Replacing numbers with their respective strings in awk - bash

I am a newbie in bash/awk programming and I have a file looks like this:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . 1
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . 1,4
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . 8,64,512
With awk, I want to change the numbers in the last column ($10) with their descriptions. I assigned the numbers and their definitions in two different arrays. The way I was thinking was to change these numbers by iterating the two array together. Here, 0 is "unknown", 1 is "germline", 4 is "somatic" and goes on.
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
number=$(IFS=,; echo "${z[*]}")
def=$(IFS=,; echo "${t[*]}")
awk -v a="$number" -v b="${def}" 'BEGIN { OFS="\t" } /#/ {next}
{
x=split(a, e, /,/)
y=split(b, f, /,/)
delete c
m=split($10, c, /,/)
for (i=1; i<=m; i++) {
for (j=1; j<=x; j++) {
if (c[i]==e[j]) {
c[i]=f[j]
}
}
$10+=sprintf("%s, ",c[i])
}
print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
}' input.vcf > output.vcf
The output should look like this:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,paternal
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
I would be so glad if you could help me!
All the best

Assuming you don't really need to define the lists of numbers and names as 2 shell arrays for some other reason:
$ cat tst.awk
BEGIN {
split("0 1 2 4 8 16 32 64 128 256 512 1024 1073741824",nrsArr)
split("unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other",namesArr)
for (i in nrsArr) {
nr2name[nrsArr[i]] = namesArr[i]
}
}
!/#/ {
n = split($NF,nrs,/,/)
sub(/[^[:space:]]+$/,"")
printf "%s", $0
for (i=1; i<=n; i++) {
printf "%s%s", nr2name[nrs[i]], (i<n ? "," : ORS)
}
}
$ awk -f tst.awk input.vcf
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
The above preserves whatever white space you had in your input file in case that matters.

You may use this awk:
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
awk -v z="${z[*]}" -v t="${t[*]}" '
BEGIN {
split(z, zarr)
split(t, tarr)
for (i=1; i in zarr; ++i)
map[zarr[i]] = tarr[i]
}
{
split($NF, arr, /,/)
s = ""
for (i=1; i in arr; ++i)
s = s (i == 1 ? "" : ",") map[arr[i]]
$NF = s;
}
1
' file
btw number 4 is mapped to inherited not paternal as you have in your expected output.

Use this short Perl in-line script:
perl -F'\t' -lane '
BEGIN {
#keys = qw( 0 1 2 4 8 16 32 64 128 256 512 1024 1073741824 );
#vals = qw( unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other );
%val = map { $keys[$_] => $vals[$_] } 0..$#keys;
}
print join "\t", #F[0..8], ( join ",", map { $val{$_} } split /,/, $F[9] );
' in_file > out_file
The Perl script uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\t/' : Split into #F on TAB, rather than on whitespace.
%val = map { $keys[$_] => $vals[$_] } 0..$#keys; : Create %val - a hash lookup table with keys = numeric codes and values = mutation/variant types.
Note that in Perl, arrays are 0-indexed.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Assumptions:
OP has confirmed beforehand that the z and t arrays are valid (eg, same number of elements in both arrays)
OP may want to (dynamically) change the contents of the z and t arrays so we'll leave the array assignments at the bash level (ie, won't hardcode inside of awk)
the substitution strings could contain white space so we'll keep OP's current method of building comma-delimited strings (from the z and t) arrays; also assumes replacement strings do not contain commas; this should simplify parsing of the replacement strings within awk
while OP has explicitly coded for (awk) field #10, we'll assume this number could change; we'll focus on processing the last field in a row
Small change to initialization code:
# original arrays
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
# renamed variables (format: x,y,z,...)
nums=$(IFS=,; echo "${z[*]}")
alphas=$(IFS=,; echo "${t[*]}")
One awk idea:
awk -v nums="${nums}" -v alphas="${alphas}" ' # pass comma-delimited variables to awk
BEGIN { OFS="\t" # copied from original code:w
n=split(nums,num,/,/) # split comma-delimted variables
a=split(alphas,alpha,/,/) # into arrays
}
/#/ { next } # copied from original code
{ l=split($NF,lastf,/,/) # split the last (comma-delimited) field
$NF="" # clear the last field
pfx="" # initialize our prefix string
for (i=1; i<=l; i++) # loop through entries in the last field
for (j=1; j<=n; j++) # loop through array of numbers
if ( lastf[i] == num[j] ) # if array entries match ...
{ $NF= $NF pfx alpha[j] # append the associated alpha to the last field
pfx="," # set the prefix to "," for the next item
break # break out one level to process next entry in the last field
}
}
{ print } # print the current line (with modified last field)
' input.vcf
The above generates:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic.germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive

Related

To split and arrange number in single inverted

I have around 65000 products codes in a text file.I wanted to split those number in group of 999 each .Then-after want each 999 number with single quotes separated by comma.
Could you please suggest how I can achieve above scenario through Unix script.
87453454
65778445
.
.
.
.
Till 65000 productscodes
Need to arrange in below pattern:
'87453454','65778445',
With awk:
awk '
++c == 1 { out = "\047" $0 "\047"; next }
{ out = out ",\047" $0 "\047" }
c == 999 { print out; c = 0 }
END { if (c) print out }
' file
Or, with GNU sed:
sed "
:a
\$bb
N
0~999{
:b
s/\n/','/g
s/^/'/
s/$/'/
b
}
ba" file
With Perl:
perl -ne '
sub pq { chomp; print "\x27$_\x27" } pq;
for (1 .. 998) {
if (defined($_ = <>)) {
print ",";
pq
}
}
print "\n"
' < file
Credit for Mauke perl#libera.chat
65000 isn't that many lines for awk - just do it all in one shot :
mawk 'BEGIN { FS = RS; RS = "^$"; OFS = (_="\47")(",")_
} gsub(/^|[^0-9]*$/,_, $!(NF = NF))'
'66771756','69562431','22026341','58085790','22563930',
'63801696','24044132','94255986','56451624','46154427'
That's for grouping them all in one line. To make 999 ones, try
jot -r 50 10000000 99999999 |
# change "5" to "999" here
rs -C= 0 5 |
mawk 'sub(".*", "\47&\47", $!(NF -= _==$NF ))' FS== OFS='\47,\47'
'36452530','29776340','31198057','36015730','30143632'
'49664844','83535994','86871984','44613227','12309645'
'58002568','31342035','72695499','54546650','21800933'
'38059391','36935562','98323086','91089765','65672096'
'17634208','14009291','39114390','35338398','43676356'
'14973124','19782405','96782582','27689803','27438921'
'79540212','49141859','25714405','42248622','25589123'
'11466085','87022819','65726165','86718075','56989625'
'12900115','82979216','65469187','63769703','86494457'
'26544666','89342693','64603075','26102683','70528492'
_==$NF checks whether right most column is empty or not,
—- i.e. whether there's a trailing edge sep that needds to be trimmed
If your input file only contains short codes as shown in your example, you could use the following hack:
xargs -L 999 bash -c "printf \'%s\', \"\$#\"; echo" . <inputFile >outputFile
Alternatively, you can use this sed command:
sed -Ene"s/(.*)/'\1',/;H" -e{'0~999','$'}'{z;x;s/\n//g;p}' <inputFile >outputFile
s/(.*)/'\1',/ wraps each line in '...',
but does not print it (-n)
instead, H appends the modified line to the so called hold space; basically a helper variable storing a single string.
(This also adds a line break as a separator, but we remove that later).
Every 999 lines (0~999) and at the end of the input file ($) ...
... the hold space is then printed and cleared (z;x;...;p)
while deleting all delimiter-linebreaks (s/\n//g) mentioned earlier.

i have a protein sequence file i want to count trimers in it using sed or grep

I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.
If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv

Using bc in awk

I am trying to use bc in an awk script. In the code below, I am trying to convert hexadecimal number to binary and store it in a variable.
#!/bin/awk -f
{
binary_vector = $(bc <<< "ibase=16;obase=2;FF")
}
Where do I go wrong?
Not saying it's a good idea but:
$ awk 'BEGIN {
cmd = "bc <<< \"ibase=16;obase=2;FF\""
rslt = ((cmd | getline line) > 0 ? line : -1)
close(cmd)
print rslt
}'
11111111
Also see http://gnu.org/software/gawk/manual/gawk.html#Bitwise-Functions and http://gnu.org/software/gawk/manual/gawk.html#Nondecimal-Data
The following one-liner Awk script should do what you want:
awk -vVAR=$(read -p "Enter number: " -u 0 num; echo $num) \
'BEGIN{system("echo \"ibase=16;obase=2;"VAR"\"|bc");}'
Explanation:
-vVAR Passes the variable VAR into Awk
-vVAR=$(read -p ... ) Sets the variable VAR from the
shell to the user input.
system("echo ... |bc") Uses the Awk system built in command to execute the shell commands. Notice how the quoting stops at the variable VAR and then continues just after it, thats so that Awk interprets VAR as an Awk variable and not as part of the string put into the system call.
Update - to use it in an Awk variable:
awk -vVAR=$(read -p "Enter number: " -u 0 num; echo $num) \
'BEGIN{s="echo \"ibase=16;obase=2;"VAR"\"|bc"; s | getline awk_var;\
close(s); print awk_var}'
s | getline awk_var will put the output of the command s into the Awk variable awk_var. Note the string is built before sending it to getline - if not (unless you parenthesize the string concatenation) Awk will try to send it to getline in separate pieces %s VAR %s.
The close(s) closes the pipe - although for bc it doesn't matter and Awk automatically closes pipes upon exit - if you put this into a more elaborate Awk script it is best to explicitly close the pipe. According to the Awk documentation some commands such as mail will wait on the pipe to close prior to completion.
http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_39.html
By the way you wrote your example, it looks like you want to convert an awk record ( line ) into an associative array. Here's an awk executable script that allows that by running the bc command over values in a split type array:
#!/usr/bin/awk -f
{
# initialize the a array
cnt = split($0, a, FS)
if( convertArrayBase(10, 2, a, cnt) > -1 ) {
# use the array here
for(i=1; i<=cnt; i++) {
print a[i]
}
}
}
# Destructively updates input array, converting numbers from ibase to obase
#
# #ibase: ibase value for bc
# #obase: obase value for bc
# #a: a split() type associative array where keys are numeric
# #cnt: size of a ( number of fields )
#
# #return: -1 if there's a getline error, else cnt
#
function convertArrayBase(ibase, obase, a, cnt, i, b, cmd) {
cmd = sprintf("echo \"ibase=%d;obase=%d", ibase, obase)
for(i=1; i<=cnt; i++ ) {
cmd = cmd ";" a[i]
}
cmd = cmd "\" | bc"
i = 0 # reset i
while( (cmd | getline b) > 0 ) {
a[++i] = b
}
close( cmd )
return i==cnt ? cnt : -1
}
When used with an input of:
1 2 3
4 s 1234567
this script outputs the following:
1
10
11
100
0
100101101011010000111
The convertArrayBase function operates on split type arrays. So you have to initialize the input array (a here) with the full row (as shown) or a field's subflds(not shown) before calling the it. It destructively updates the array.
You could instead call bc directly with some helper files to get similar output. I didn't find that bc supported - ( stdin as a file name ) so
it's a little more than I'd like.
Making a start_cmds file like this:
ibase=10;obase=2;
and a quit_cmd like:
;quit
Given an input file (called data.semi) where the data is separated by a ;, like this:
1;2;3
4;s;1234567
you can run bc like:
$ bc -q start_cmds data.semi quit_cmd
1
10
11
100
0
100101101011010000111
which is the same data that the awk script is outputting, but only calling bc a single time with all of the inputs. Now, while that data isn't in an awk associative array in a script, the bc output could be written as stdin input to awk and reassembed into an array like:
bc -q start_cmds data.semi quit_cmd | awk 'FNR==NR {a[FNR]=$1; next} END { for( k in a ) print k, a[k] }' -
1 1
2 10
3 11
4 100
5 0
6 100101101011010000111
where the final dash is telling awk to treat stdin as an input file and lets you add other files later for processing.

Find smallest missing integer in an array

I'm writing a bash script which requires searching for the smallest available integer in an array and piping it into a variable.
I know how to identify the smallest or the largest integer in an array but I can't figure out how to identify the 'missing' smallest integer.
Example array:
1
2
4
5
6
In this example I would need 3 as a variable.
Using sed for this would be silly. With GNU awk you could do
array=(1 2 4 5 6)
echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }'
...which remembers all numbers, then counts from 1 until it finds one that it doesn't remember and prints that. You can then remember this number in bash with
array=(1 2 4 5 6)
number=$(echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }')
However, if you're already using bash, you could just do the same thing in pure bash:
#!/bin/bash
array=(1 2 4 5 6)
declare -a seen
for i in ${array[#]}; do
seen[$i]=1
done
for((number = 1; seen[number] == 1; ++number)); do true; done
echo $number
You can iterate from minimal to maximal number and take first non existing element,
use List::Util qw( first );
my #arr = sort {$a <=> $b} qw(1 2 4 5 6);
my $min = $arr[0];
my $max = $arr[-1];
my %seen;
#seen{#arr} = ();
my $first = first { !exists $seen{$_} } $min .. $max;
This code will do as you ask. It can easily be accelerated by using a binary search, but it is clearest stated in this way.
The first element of the array can be any integer, and the subroutine returns the first value that isn't in the sequence. It returns undef if the complete array is contiguous.
use strict;
use warnings;
use 5.010;
my #data = qw/ 1 2 4 5 6 /;
say first_missing(#data);
#data = ( 4 .. 99, 101 .. 122 );
say first_missing(#data);
sub first_missing {
my $start = $_[0];
for my $i ( 1 .. $#_ ) {
my $expected = $start + $i;
return $expected unless $_[$i] == $expected;
}
return;
}
output
3
100
Here is a Perl one liner:
$ echo '1 2 4 5 6' | perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;'
If you want to switch from a pipeline to a file input:
$ perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;' file
Since it is sorted in the process, input can be in arbitrary order.
$ cat tst.awk
BEGIN {
split("1 2 4 5 6",a)
for (i=1;a[i+1]==a[i]+1;i++) ;
print a[i]+1
}
$ awk -f tst.awk
3
Having fun with #Borodin's excellent answer:
#!/usr/bin/env perl
use 5.020; # why not?
use strict;
use warnings;
sub increasing_stream {
my $start = int($_[0]);
return sub {
$start += 1 + (rand(1) > 0.9);
};
}
my $stream = increasing_stream(rand(1000));
my $first = $stream->();
say $first;
while (1) {
my $next = $stream->();
say $next;
last unless $next == ++$first;
$first = $next;
}
say "Skipped: $first";
Output:
$ ./tyu.pl
381
382
383
384
385
386
387
388
389
390
391
392
393
395
Skipped: 394
Here's one bash solution (assuming the numbers are in a file, one per line):
sort -n numbers.txt | grep -n . |
grep -v -m1 '\([0-9]\+\):\1' | cut -f1 -d:
The first part sorts the numbers and then adds a sequence number to each one, and the second part finds the first sequence number which doesn't correspond to the number in the array.
Same thing, using sort and awk (bog-standard, no extensions in either):
sort -n numbers.txt | awk '$1!=NR{print NR;exit}'
Here is a slight variation on the theme set by other answers. Values coming in are not necessarily pre-sorted:
$ cat test
sort -nu <<END-OF-LIST |
1
5
2
4
6
END-OF-LIST
awk 'BEGIN { M = 1 } M > $1 { next } M == $1 { M++; next }
M < $1 { exit } END { print M }'
$ sh test
3
Notes:
If numbers are pre-sorted, do not bother with the sort.
If there are no missing numbers, the next higher number is output.
In this example, a here document supplies numbers, but one can use a file or pipe.
M may start greater than the smallest to ignore missing numbers below a threshold.
To auto-start the search at the lowest number, change BEGIN { M = 1 } to NR == 1 { M = $1 }.

Uniq in awk; removing duplicate values in a column using awk

I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"

Resources