Linux Bash count and summarize by unique columns - bash

I have a text file with lines like this (in Linux Bash):
A B C D
A B C J
E B C P
E F G N
E F G P
A B C Q
H F S L
G Y F Q
H F S L
I need to find the lines with unique values for the first 3 columns, print their count and then print summarized last column for each unique line, so the result is like this:
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L
What I have tried:
cat FILE | sort -k1,3 | uniq -f3 -c | sort -k3,5nr
Is there maybe any advice?
Thanks in advance!

The easiest is to do the following:
awk '{key=$1 OFS $2 OFS $3; a[key]=a[key]","$4; c[key]++}
END{for(key in a) { print c[key],key,substr(a[key],2) }}' <file>
If you do not want any duplication, you can do
awk '{ key=$1 OFS $2 OFS $3; c[key]++ }
!gsub(","$4,","$4,a[key]) {a[key]=a[key]","$4; }
END{for(key in a) { print c[key],key,substr(a[key],2) }} <file>

Could you please try following and let me know if this helps you.
This will give you output in same sequence as per Input_file's $1, $2, and $3 occurrence only.
awk '
!a[$1,$2,$3]++{
b[++count]=$1 FS $2 FS $3
}
{
c[$1,$2,$3]=c[$1,$2,$3]?c[$1,$2,$3] "," $4:$0
d[$1 FS $2 FS $3]++
}
END{
for(i=1;i<=count;i++){
print d[b[i]],c[b[i]]
}
}
' SUBSEP=" " Input_file

Another using GNU awk and 2d arrays for removing duplicates in $4:
$ awk '{
i=$1 OFS $2 OFS $3 # key to hash
a[i][$4] # store each $4 to separate element
c[i]++ # count key references
}
END {
for(i in a) {
k=1 # comma counter for output
printf "%s %s ",c[i],i # output count and key
for(j in a[i]) # each a[]i[j] element
printf "%s%s",((k++)==1?"":","),j # output commas and elements
print "" # line-ending
}
}' file
Output in default random order:
2 E F G N,P
3 A B C Q,D,J
1 G Y F Q
1 E B C P
2 H F S L
Since we are using GNU awk the order of output could be affected easily by setting PROCINFO["sorted_in"]="#ind_str_asc":
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L

You could utilize GNU datamash:
$ cat input
A B C D
A B C J
E B C P
E F G N
E F G P
A B C Q
H F S L
G Y F Q
H F S L
$ datamash -t' ' --sort groupby 1,2,3 unique 4 count 4 < input
A B C D,J,Q 3
E B C P 1
E F G N,P 2
G Y F Q 1
H F S L 2
This unfortunately outputs the count as the last column. If it is absolutely necessary for it to be the first column, you will have to reformat it:
$ datamash -t' ' --sort groupby 1,2,3 unique 4 count 4 < input | awk '{$0=$NF FS $0; NF--}1'
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L

Related

converting four columns to two using linux commands

I am wondering how one could merge four columns into two in the following manner (using the awk command, or other possible commands).
For example,
Old:
A B C D
E F G H
I J K L
M N O P
.
.
.
New:
A B
C D
E F
G H
I J
K L
M N
O P
.
.
Thanks so much!
That's actually quite easy with awk, as per the following transcript:
pax> cat inputFile
A B C D
E F G H
pax> awk '{printf "%s %s\n%s %s\n", $1, $2, $3, $4}' <inputFile
A B
C D
E F
G H
Hww about using xargs here? Could you please try following once.
xargs -n 2 < Input_file
Output will be as follows.
A B
C D
E F
G H
I J
K L
M N
O P
with GNU sed
$ sed 's/ /\n/2' file
replace 2nd space with new line.

How to merge rows in a file based on common fields using awk?

I have a large tab delimited two column file that has the coordinates of many biochemical pathways like this:
A B
B D
D F
F G
G I
A C
C P
P R
A M
M L
L X
I want to combine the lines if column 1 in one line is equal to column 2 in another line resulting in the following output:
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
I would like to use something simple such as an awk 1 liner, does anyone have any idea how I would approach this without writing a shell script? Any help is appreciated. I am trying to get each step and each subsequent step in each pathway. As these pathways often intersect some steps are shared by other pathways but I want to analyse each separately.
I have tried a shell script where I try to grep out any column where $2 = $1 later in the file:
while [ -s test ]; do
grep -m1 "^" test > i
cut -f2 i | sed 's/^/"/' | sed 's/$/"/' | sed "s/^/awk \'\$1 == /" | sed "s/$/' test >> i/" > i.sh
sh i.sh
perl -p -e 's/\n/\t/g' i >> OUT
sed '1d' test > i ; mv i test
done
I know that my problem comes from (a) deleting the line and (b) the fact that there are duplicates. I am just not sure how to tackle this.
Input
$ cat f
A B
B D
D F
F G
G I
A C
C P
P R
A M
M L
L X
Output
$ awk '{
for(j=1; j<=NF; j+=2)
{
for(i=j;i<=NF;i+=2)
{
printf("%s%s", i==j ? $i OFS : OFS,$(i+1));
if($(i+1)!=$(i+2)){ print ""; break }
}
}
}' RS= OFS="\t" f
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
One liner
awk '{ for(j=1; j<=NF; j+=2)for(i=j;i<=NF;i+=2){printf("%s%s", i==j ? $i OFS : OFS,$(i+1)); if($(i+1)!=$(i+2)){ print ""; break }}}' RS= OFS="\t" f
Well, you could put this on one line, but I wouldn't recommend it :)
#!/usr/bin/awk -f
{
a[NR] = $0
for(i = 1; i < NR; i++){
if(a[i] ~ $1"$")
a[i] = a[i] FS $2
if(a[i] ~ "^"$1){
for(j = i; j < NR; j++){
print a[j]
delete a[j]
}
}
}
}
END{
for(i = 1; i <= NR; i++)
if(a[i] != "")
print a[i]
}
$ <f.txt tac | awk 'BEGIN{OFS="\t"}{if($2==c1){$2=$2"\t"c2};print $1,$2;c1=$1;c2=$2}' | tac
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X

Merging two outputs in shell script

I have output of 2 commands like:
op of first cmd:
A B
C D
E F
G H
op of second cmd:
I J
K L
M B
i want to merge both the outputs , and if a value in second column is same for both outputs, I'll take entry set from 1st output..
So , my output should be
A B
C D
E F
G H
I J
K L
//not taking (M B) sice B is already there in first entry(A B) , so giving preference to first output
can i do this using shell script , is there any command?
You can use awk:
awk 'FNR==NR{a[$2];print;next} !($2 in a)' file1 file2
A B
C D
E F
G H
I J
K L
If the order of entries is not important, you can sort on the 2nd column and uniquefy:
sort -u -k2 file1 file2
Both -u and -k are specified in the POSIX standard
This wouldn't work if there are repeated entries in the 2nd column of file1.

Search for a column by name in awk

I have a file that has many columns. Let us say "Employee_number" "Employee_name" "Salary". I want to display all entries in a column by giving all or part of the column name. For example if my input "name" I want all the employee names printed. Is it possible to do this in a simple manner using awk?
Thanks
Given a script getcol.awk as follows:
BEGIN {
colname = ARGV[1]
ARGV[1] = ""
getline
for (i = 1; i <= NF; i++) {
if ($i ~ colname) {
break;
}
}
if (i > NF) exit
}
{print $i}
... and the input file test.txt:
apple banana candy deer elephant
A B C D E
A B C D E
A B C D E
A B C D E
A B C D E
A B C D E
A B C D E
... the command:
$ awk -f getcol.awk b <test.txt
... gives the following output:
B
B
B
B
B
B
B
Note that the output text does not include the first line of the test file, which is treated as a header.
Simple one-liner will do the trick:
$ cat file
a b c
1 2 3
1 2 3
1 2 3
$ awk -v c="a" 'NR==1{for(i=1;i<=NF;i++)n=$i~c?i:n;next}n{print $n}' file
1
1
1
$ awk -v c="b" 'NR==1{for(i=1;i<=NF;i++)n=$i~c?i:n;next}n{print $n}' file
2
2
2
$ awk -v c="c" 'NR==1{for(i=1;i<=NF;i++)n=$i~c?i:n;next}n{print $n}' file
3
3
3
# no column d so no output
$ awk -v c="d" 'NR==1{for(i=1;i<=NF;i++)n=$i~c?i:n;next}n{print $n}' file
Note: as in your requirement you want name to match employee_name just be aware if you give employee you will get the last column matching employee this is easily changed however.

File Manipulation Loop

Ok, this question is two-fold: 1) The actual file manupulation bit 2) Looping this manipulation in unix
Part 1)
I have two files:
File_1
a b
c d
e f
g h
and File_2
A B
C D
E F
G H
I J
i would like to get (in the first instance) the following result:
a b
A B
>
c d
A B
>
e f
A B
>
g h
A B
...and save this output to outfile1.
I gather I would have to use things like awk, cut and/or paste but I can't manage to put it all together.
Part 2)
I then want to loop this manipulation for all rows in File_2 (note that the number of rows in File_1 is not the same as in File_2), such that I end up with 5 output files, where outfile2 would be:
a b
C D
>
c d
C D
>
e f
C D
>
g h
C D
and outfile3 would be:
a b
E F
>
c d
E F
>
e f
E F
>
g h
E F
etc.
At the moment I'm working in bash. Thank you in advance for any help!
This can be done with bash redirection:
i=1
while read f2; do
while read f1; do
echo "$f1"
echo "$f2"
echo ">"
done < File_1 | head -n -1 > output$i
(( i++ ))
done < File_2
head -n -1 avoids having a lone delimiter at the end of each output$i file.
to make outfile_3(with E F) for example:
x=$(sed -n '3p' File_2)
awk "{ printf \"%s\\n%s\\n>\\n\", \$0, \"$x\" }" File_1 > outfile_3
in first line '3p' cuts 3-rd line
Now let's do it in a loop:
(( i = 1 ))
while read line
do
awk "{ printf \"%s\\n%s\\n>\\n\", \$0, \"$line\" }" File_1 > "outfile_$i"
(( i++ ))
done < File_2
sort -m -f file1 file2 | uniq -i --all-repeated=separate
Looked rather close. However, on second reading I think you want something more like this perl script:
use strict;
use warnings;
open(my $FILE1, '<file1') or die;
my $output = 0;
while (my $a = <$FILE1>)
{
$output++;
open(my $OUT, ">output$output");
open(my $FILE2, '<file2') or die;
print $OUT "$a$_---\n" foreach (<$FILE2>);
close $FILE2;
close $OUT;
}
close $FILE1;
This creates output files output1, output12, output3, output..., as many as there are lines in file1
an awk one liner:
awk 'NR==FNR{a[NR]=$0;l=NR;next;} {b[FNR]=$0;}
END{f=1; for(x=1;x<=FNR;x++){for(i=1;i<=length(a);i++){
printf "%s\n%s\n%s\n", a[i],b[x],">" > "output"f }f++;}}' f1 f2
test:
kent$ head f1 f2
==> f1 <==
a b
c d
e f
g h
==> f2 <==
A B
C D
E F
G H
I J
kent$ awk 'NR==FNR{a[NR]=$0;l=NR;next;} {b[FNR]=$0;}
END{f=1; for(x=1;x<=FNR;x++){for(i=1;i<=length(a);i++){printf "%s\n%s\n%s\n", a[i],b[x],">" > "output"f }f++;}}' f1 f2
kent$ head -30 out*
==> output1 <==
a b
A B
>
c d
A B
>
e f
A B
>
g h
A B
>
==> output2 <==
a b
C D
>
c d
C D
>
e f
C D
>
g h
C D
>
==> output3 <==
a b
E F
>
c d
E F
>
e f
E F
>
g h
E F
>
==> output4 <==
a b
G H
>
c d
G H
>
e f
G H
>
g h
G H
>
==> output5 <==
a b
I J
>
c d
I J
>
e f
I J
>
g h
I J
>

Resources