Aggregating csv file in bash script - bash

I have csv file with multiple lines. Each line has the same number of columns. What I need to do is to group those lines by a few specified columns and aggregate data from other columns. Example of input file:
proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2
For above example I need to group lines by first two columns. From 3rd column I need to choose the min value, for 4th column max value, and 5th column should have the sum. So, for such input file I need output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
I need to process it in bash (I can use awk or sed as well).

With bash and sort:
#!/bin/bash
# create associative arrays
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de # date start and date end
declare -A -i sum # set integer attribute
# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }
# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do
# if associative array is still empty for this entry
# fill with current strings/value
if [[ -z ${p[$p1,$p2]} ]]; then
p[$p1,$p2]="$p1,$p2"
ds[$p1,$p2]="$d1"
de[$p1,$p2]="$d2"
sum[$p1,$p2]="$s"
continue
fi
# compare strings, set new strings and sum value
if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
[[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
[[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
sum[$p1,$p2]=sum[$p1,$p2]+s
fi
done < file
# print content of all associative arrays with key vom associative array p
for i in "${!p[#]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done
Usage: ./script.sh | sort
Output to stdout:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
See: help declare, help read and of course man bash

With awk + sort
awk -F',|-' '
BEGIN{
A["Jan"]="01"
A["Feb"]="02"
A["Mar"]="03"
A["Apr"]="04"
A["May"]="05"
A["Jun"]="06"
A["July"]="07"
A["Aug"]="08"
A["Sep"]="09"
A["Oct"]="10"
A["Nov"]="11"
A["Dec"]="12"
}
{
B[$1","$2]=B[$1","$2]+$9
z=sprintf( "%.2d",$3)
y=sprintf("%s",$5 A[$4] z)
if(!start[$1$2])
{
end[$1$2]=0
start[$1$2]=99999999
}
if (y < start[$1$2])
{
start[$1$2]=y
C[$1","$2]=$3"-"$4"-"$5
}
x=sprintf( "%.2d",$6)
w=sprintf("%s",$8 A[$7] x)
if(w > end[$1$2] )
{
end[$1$2]=w
D[$1","$2]=$6"-"$7"-"$8
}
}
END{
for (i in B)print i "," C[i] "," D[i] "," B[i]
}
' infile | sort

Extended GNU awk solution:
awk -F, 'function parse_date(d_str){
split(d_str, d, "-");
t = mktime(sprintf("%d %d %d 00 00 00", d[3], m[d[2]], d[1]));
return t
}
BEGIN{ m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6;
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;
}
{
k=$1 SUBSEP $2;
if (k in a){
if (parse_date(a[k]["min"]) > parse_date($3)) { a[k]["min"]=$3 }
if (parse_date(a[k]["max"]) < parse_date($4)) { a[k]["max"]=$4 }
} else {
a[k]["min"]=$3; a[k]["max"]=$4
}
a[k]["sum"]+= $5
}
END{
for (i in a) {
split(i, j, SUBSEP);
print j[1], j[2], a[i]["min"], a[i]["max"], a[i]["sum"]
}
}' OFS=',' file
The output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

Related

How to multiply columns in Bash?

I have the following three column data (first row is header) in a csv format
Value
Y
X
A
8
2
B
3
5
C
7
9
I want the following output also in a csv format
Value
Y*X
AB
40
BA
6
AC
72
CA
14
BC
27
CB
35
Is there a way to accomplish this in bash?
thank you
Here is the csv file copy-paste
VALUE,Y,X
A,13,7
C,0,0
D,3,25
E,2,44
F,0,6
H,1,1
I,5,3
K,45,3
L,1,31
M,2,3
N,3,3
P,113,87
Q,13,11
R,20,5
S,7,9
T,9,4
V,7,3
Y,1,0
I tried awk '{print $2*$3}' TEST.dat but the problem is it is not combinatorial i.e. not multiplying every column 2 value to every column 3 value.
Using awk:
awk -F, 'BEGIN { print "VALUE,Y*X"; i=0 } # Print header
FNR == 1 { next } # Skip existing header lines
FNR == NR { x[++i]=$3; values[i]=$1; next } # First pass through the file
{ # Second pass; multiply current row against every saved row
for (n = 1; n <= i; n++)
if (values[n] != $1) # Except itself
printf "%s%s,%d\n", $1, values[n], $2 * x[n]
}' input.csv input.csv
Process the file twice; first time saving the x values, and second time multiplying the current line's y against all the saved x's.
For fun, a version that uses sqlite, importing the CSV file and then doing a self-join:
sqlite3 -batch -header -csv <<EOF
.import input.csv data
SELECT a.value || b.value AS "VALUE", a.y * b.x AS "Y*X"
FROM data AS a
JOIN data AS b ON a.value <> b.value
ORDER BY a.rowid, b.rowid;
EOF
And pure bash:
#!/usr/bin/env bash
declare -a values yvalues xvalues
exec 3<input.csv
read -r -u 3 _ # Read and discard header
declare -i i=0
while IFS=, read -r -u 3 value y x; do
i+=1
values[i]=$value
yvalues[i]=$y
xvalues[i]=$x
done
echo "VALUE,Y*X"
for ((a=1; a<=i; a++)); do
for ((b=1; b<=i; b++)); do
if [[ $a -ne $b ]]; then
printf "%s%s,%d\n" "${values[a]}" "${values[b]}" \
"$(( yvalues[a] * xvalues[b] ))"
fi
done
done

i have a protein sequence file i want to count trimers in it using sed or grep

I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.
If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv

How to use awk or anything else to number of shared x values of 2 different y values in a csv file consists of column a and b?

Let me be specific. We have a csv file consists of 2 columns x and y like this:
x,y
1h,a2
2e,a2
4f,a2
7v,a2
1h,b6
4f,b6
4f,c9
7v,c9
...
And we want to count how many shared x values two y values have, which means we want to get this:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
And b6,a2,2 should not show up. Does anyone know how to do this by awk? Or anything else?
Thx ahead!
Try this executable awk script:
#!/usr/bin/awk -f
BEGIN {FS=OFS=","}
NR==1 { print "y1" OFS "y2" OFS "share" }
NR>1 {last=a[$1]; a[$1]=(last!=""?last",":"")$2}
END {
for(i in a) {
cnt = split(a[i], arr, FS)
if( cnt>1 ) {
for(k=1;k<cnt;k++) {
for(i=2;i<=cnt;i++) {
if( arr[k] != arr[i] ) {
key=arr[k] OFS arr[i]
if(out[key]=="") {order[++ocnt]=key}
out[key]++
}
}
}
}
}
for(i=1;i<=ocnt;i++) {
print order[i] OFS out[order[i]]
}
}
When put into a file called awko and made executable, running it like awko data yields:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
I'm assuming the file is sorted by y values in the second column as in the question( after the header ). If it works for you, I'll add some explanations tomorrow.
Additionally for anyone who wants more test data, here's a silly executable awk script for generating some data similar to what's in the question. Makes about 10K lines when run like gen.awk.
#!/usr/bin/awk -f
function randInt(max) {
return( int(rand()*max)+1 )
}
BEGIN {
a[1]="a"; a[2]="b"; a[3]="c"; a[4]="d"; a[5]="e"; a[6]="f"
a[7]="g"; a[8]="h"; a[9]="i"; a[10]="j"; a[11]="k"; a[12]="l"
a[13]="m"; a[14]="n"; a[15]="o"; a[16]="p"; a[17]="q"; a[18]="r"
a[19]="s"; a[20]="t"; a[21]="u"; a[22]="v"; a[23]="w"; a[24]="x"
a[25]="y"; a[26]="z"
print "x,y"
for(i=1;i<=26;i++) {
amultiplier = randInt(1000) # vary this to change the output size
r = randInt(amultiplier)
anum = 1
for(j=1;j<=amultiplier;j++) {
if( j == r ) { anum++; r = randInt(amultiplier) }
print a[randInt(26)] randInt(5) "," a[i] anum
}
}
}
I think if you can get the input into a form like this, it's easy:
1h a2 b6
2e a2
4f a2 b6 c9
7v a2 c9
In fact, you don't even need the x value. You can convert this:
a2 b6
a2
a2 b6 c9
a2 c9
Into this:
a2,b6
a2,b6
a2,c9
a2,c9
That output can be sorted and piped to uniq -c to get approximately the output you want, so we only need to think much about how to get from your input to the first and second states. Once we have those, the final step is easy.
Step one:
sort /tmp/values.csv \
| awk '
BEGIN { FS="," }
{
if (x != $1) {
if (x) print values
x = $1
values = $2
} else {
values = values " " $2
}
}
END { print values }
'
Step two:
| awk '
{
for (i = 1; i < NF; ++i) {
for (j = i+1; j <= NF; ++j) {
print $i "," $j
}
}
}
'
Step three:
| sort | awk '
BEGIN {
combination = $0
print "y1,y2,share"
}
{
if (combination == $0) {
count = count + 1
} else {
if (count) print combination "," count
count = 1
combination = $0
}
}
END { print combination "," count }
'
This awk script does the job:
BEGIN { FS=OFS="," }
NR==1 { print "y1","y2","share" }
NR>1 { ++seen[$1,$2]; ++x[$1]; ++y[$2] }
END {
for (y1 in y) {
for (y2 in y) {
if (y1 != y2 && !(y2 SUBSEP y1 in c)) {
for (i in x) {
if (seen[i,y1] && seen[i,y2]) {
++c[y1,y2]
}
}
}
}
}
for (key in c) {
split(key, a, SUBSEP)
print a[1],a[2],c[key]
}
}
Loop through the input, recording both the original elements and the combinations. Once the file has been processed, look at each pair of y values. The if statement does two things: it prevents equal y values from being compared and it saves looping through the x values twice for every pair. Shared values are stored in c.
Once the shared values have been aggregated, the final output is printed.
This sed script does the trick:
#!/bin/bash
echo y1,y2,share
x=$(wc -l < file)
b=$(echo "$x -2" | bc)
index=0
for i in $(eval echo "{2..$b}")
do
var_x_1=$(sed -n ''"$i"p'' file | sed 's/,.*//')
var_y_1=$(sed -n ''"$i"p'' file | sed 's/.*,//')
a=$(echo "$i + 1" | bc)
for j in $(eval echo "{$a..$x}")
do
var_x_2=$(sed -n ''"$j"p'' file | sed 's/,.*//')
var_y_2=$(sed -n ''"$j"p'' file | sed 's/.*,//')
if [ "$var_x_1" = "$var_x_2" ] ; then
array[$index]=$var_y_1,$var_y_2
index=$(echo "$index + 1" | bc)
fi
done
done
counter=1
for (( k=1; k<$index; k++ ))
do
if [ ${array[k]} = ${array[k-1]} ] ; then
counter=$(echo "$counter + 1" | bc)
else
echo ${array[k-1]},$counter
counter=1
fi
if [ "$k" = $(echo "$index-1"|bc) ] && [ $counter = 1 ]; then
echo ${array[k]},$counter
fi
done

Reduced permutations

Consider the following string
abcd
I can return 2 character permutations
(cartesian product)
like this
$ echo {a,b,c,d}{a,b,c,d}
aa ab ac ad ba bb bc bd ca cb cc cd da db dc dd
However I would like to remove redundant entries such as
ba ca cb da db dc
and invalid entries
aa bb cc dd
so I am left with
ab ac ad bc bd cd
Example
Here's a pure bash one:
#!/bin/bash
pool=( {a..d} )
for((i=0;i<${#pool[#]}-1;++i)); do
for((j=i+1;j<${#pool[#]};++j)); do
printf '%s\n' "${pool[i]}${pool[j]}"
done
done
and another one:
#!/bin/bash
pool=( {a..d} )
while ((${#pool[#]}>1)); do
h=${pool[0]}
pool=("${pool[#]:1}")
printf '%s\n' "${pool[#]/#/$h}"
done
They can be written as functions (or scripts):
get_perms_ordered() {
local i j
for((i=1;i<"$#";++i)); do
for((j=i+1;j<="$#";++j)); do
printf '%s\n' "${!i}${!j}"
done
done
}
or
get_perms_ordered() {
local h
while (("$#">1)); do
h=$1; shift
printf '%s\n' "${#/#/$h}"
done
}
Use as:
$ get_perms_ordered {a..d}
ab
ac
ad
bc
bd
cd
This last one can easily be transformed into a recursive function to obtain ordered permutations of a given length (without replacement—I'm using the silly ball-urn probability vocabulary), e.g.,
get_withdraws_without_replacement() {
# $1=number of balls to withdraw
# $2,... are the ball "colors"
# return is in array gwwr_ret
local n=$1 h r=()
shift
((n>0)) || return
((n==1)) && { gwwr_ret=( "$#" ); return; }
while (("$#">=n)); do
h=$1; shift
get_withdraws_without_replacement "$((n-1))" "$#"
r+=( "${gwwr_ret[#]/#/$h}" )
done
gwwr_ret=( "${r[#]}" )
}
Then:
$ get_withdraws_without_replacement 3 {a..d}
$ echo "${gwwr_ret[#]}"
abc abd acd bcd
You can use awk to filter away the entries you don't want:
echo {a,b,c,d}{a,b,c,d} | awk -v FS="" -v RS=" " '$1 == $2 { next } ; $1 > $2 { SEEN[ $2$1 ] = 1 ; next } ; { SEEN[ $1$2 ] =1 } ; END { for ( I in SEEN ) { print I } }'
In details:
echo {a,b,c,d}{a,b,c,d} \
| awk -v FS="" -v RS=" " '
# Ignore identical values
$1 == $2 { next }
# Reorder and record inverted entries
$1 > $2 { SEEN[ $2$1 ] = 1 ; next }
# Record everything else
{ SEEN[ $1$2 ] = 1 }
# Print the final list
END { for ( I in SEEN ) { print I } }
'
FS="" tells awk that each character is a separate field.
RS=" " uses spaces to separate records.
I'm sure someone's going to do this in one line of awk, but here is something in bash:
#!/bin/bash
seen=":"
result=""
for i in "$#"
do
for j in "$#"
do
if [ "$i" != "$j" ]
then
if [[ $seen != *":$j$i:"* ]]
then
result="$result $i$j"
seen="$seen$i$j:"
fi
fi
done
done
echo $result
Output:
$ ./prod.sh a b c d
ab ac ad bc bd cd
$ ./prod.sh I have no life
Ihave Ino Ilife haveno havelife nolife
here is a pseudo code to achieve that, based on your restrictions, and
using an array for your characters:
for (i=0;i<array.length;i++)
{
for (j=i+1;j<array.length;j++)
{
print array[i] + array[j]; //concatenation
}
}
I realized that I am not looking for permutations, but the power set. Here
is an implementation in Awk:
{
for (c = 0; c < 2 ^ NF; c++) {
e = 0
for (d = 0; d < NF; d++)
if (int(c / 2 ^ d) % 2) {
printf "%s", $(d + 1)
}
print ""
}
}
Input:
a b c d
Output:
a
b
ab
c
ac
bc
abc
d
ad
bd
abd
cd
acd
bcd
abcd
Example

Uniq in awk; removing duplicate values in a column using awk

I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"

Resources