awk to remove duplicate rows totally based on a particular column value - sorting

I got a dataset like:
6 AA_A_56_30018678_E 0 30018678 P A
6 SNP_A_30018678 0 30018678 A G
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
6 AA_A_62_30018696_Q 0 30018696 P A
6 AA_A_62_30018696_G 0 30018696 P A
6 AA_A_62_30018696_R 0 30018696 P A
I want to remove all the rows if col 4 have duplicates.
I have use the below codes (using sort, awk,uniq and join...) to get the required output, however, is there a better way to do this?
sort -k4,4 example.txt | awk '{print $4}' | uniq -u > snp_sort.txt
join -1 1 -2 4 snp_sort.txt example.txt | awk '{print $3,$5,$6,$1}' > uniq.txt
Here is the output
SNP_A_30018679 T G 30018679
SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695

Using awk to filter-out duplicate lines and print those lines which occur exactly once.
awk '{k=($2 FS $5 FS $6 FS $4)} {a[$4]++;b[$4]=k}END{for(x in a)if(a[x]==1)print b[x]}' input_file
SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695
SNP_A_30018679 T G 30018679
The idea is to:-
Store all unique $4 entries in a an array(a) and maintain a counter for that in array b
Print the array for those entries which occur exactly once.

Using command substitution: First print only unique columns in fourth field and then grep those columns.
grep "$(echo "$(awk '{print $4}' inputfile.txt)" |sort |uniq -u)" inputfile.txt
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
Note: add awk '{NF=4}1' at the end of the command, if you wist to print first four columns. Of course you can change the number of columns by changing value of $4 and NF=4.

$ awk 'NR==FNR{c[$4]++;next} c[$4]<2' file file
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C

Since your 'key' is fixed width, then uniq has a -w to check on it.
sort -k4,4 example.txt | uniq -u -f 3 -w 8 > uniq.txt

Another in awk:
$ awk '{$1=$1; a[$4]=a[$4] $0} END{for(i in a) if(gsub(FS,FS,a[i])==5) print a[i]}' file
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
Catenate to array using $4 as key. If there are more than 5 field separators, duplicates were catenated and will not be printed.
And yet an another version in awk. It expects the file to be sorted on the fourth field. It won't store all lines in memory, only the keys (this probably could be dealt with also since the key field must be sorted, may be fixed later) and runs in one go:
$ cat ananother.awk
++seen[p[4]]==1 && NR>1 && p[4]!=$4 { # seen count must be 1 and
print prev # this and previous $4 must differ
delete seen # is this enough really?
}
{
q=p[4] # previous previous $4 for END
prev=$0 # previous is stored for printing
split($0,p) # to get previous $4
}
END { # last record control
if(++seen[$4]==1 && q!=$4)
print $0
}
Run:
$ sort -k4,4 file | awk -f ananother.awk

A simpler way to achieve this,
cat file.csv | cut -d, -f3,5,6,1 | sort -u > uniq.txt

Related

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

Replace atomic numbers in a column with the corresponding atomic symbols in a file

Can anybody tell me how can I replace atomic numbers in the first column of a file by the corresponding atomic symbols in bash script? I have many file to be replaced this way.
file:HCOOH
6 0 -.134702 .401251 -.000249
8 0 -1.134262 -.264582 .000069
8 0 1.118680 -.091075 .000056
1 0 -.107617 1.495465 .000513
1 0 1.040484 -1.057714 -.000020
Desired Result:
C 0 -.134702 .401251 -.000249
O 0 -1.134262 -.264582 .000069
O 0 1.118680 -.091075 .000056
H 0 -.107617 1.495465 .000513
H 0 1.040484 -1.057714 -.000020
My aim is to extract the geometry of a system from the benchmark databases given in the supporting in formation of the paper "J. Chem. Theory Comput., 2005, 1 (3), pp 415–432 DOI: 10.1021/ct049851d" . As the atoms are given as the atomic numbers in the databases I can not use geometry directly in the NWCHEM code. Therefore I need to replace them with their corresponding symbols. Using the script
#!/bin/bash
atoms=(HCOOH H He Li Be B C N O F Ne)
name="$(awk '{print $1}' HCOOH)"
rm atom
for j in ${name};
do
echo ${atoms[$j]} >>atom
done
awk 'FNR==NR{a[NR]=$1;next}{$1=a[FNR]}1' atom HCOOH | awk '{printf "%-3s %-1s %10.5f %10.5f %10.5f\n", $1, $2, $3, $4, $5}'
I am getting
HCOOH 0.00000 0.00000 0.00000
C 0 -0.13470 0.40125 -0.00025
O 0 -1.13426 -0.26458 0.00007
O 0 1.11868 -0.09108 0.00006
H 0 -0.10762 1.49547 0.00051
H 0 1.04048 -1.05771 -0.00002
I could not escape zeros coming in the first line if the formatted output is required. I would be happy if someone can help to print the formatted output without printing zeros in the first line.
Thanks.
Finally, got the desired results using the script
#!/bin/bash
atoms=(HCOOH H He Li Be B C N O F Ne)
name="$(awk '{print $1}' HCOOH)"
rm atom
for j in ${name};
do
echo ${atoms[$j]} >>atom
done
awk 'FNR==NR{a[NR]=$1;next}{$1=a[FNR]}1' atom HCOOH | awk 'NR==1{printf "%-3s\n", $1}' >tHCOOH
awk 'FNR==NR{a[NR]=$1;next}{$1=a[FNR]}1' atom HCOOH | awk 'NR> 1{printf "%-3s %-1s %10.5f %10.5f %10.5f\n", $1, $2, $3, $4, $5}' >>tHCOOH #|
mv tHCOOH HCOOH
which is
HCOOH
C 0 -0.13470 0.40125 -0.00025
O 0 -1.13426 -0.26458 0.00007
O 0 1.11868 -0.09108 0.00006
H 0 -0.10762 1.49547 0.00051
H 0 1.04048 -1.05771 -0.00002
Let me know if there is a better way for getting the same output.
Thanks.

Bash: replacing a column by another and using AWK to print specific order

I have a dummy file that looks like so:
a ID_1 S1 S2
b SNP1 1 0
c SNP2 2 1
d SNP3 1 0
I want to replace the contents of column 2 by the corresponding line number. My file would then look like so:
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
I can do this with the following command:
cut -f 1,3-4 -d " " file.txt | awk '{print $1 " " FNR " " $2,$3}'
My question is, is there a better way of doing this? In particular, the real file I am working on has 2303 columns. Obviously I don't want to have to write:
cut -f 1,3-2303 -d " " file.txt | awk '{print $1 " " FNR " " $2,$3,$4,$5 ETC}'
Is there a way to tell awk to print from column 2 to the last column without having to write all the names?
Thanks
I think this should do
$ awk '{$2=FNR} 1' file.txt
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
change second column and print the changed record. Default OFS is single space which is what you need here
the above command is idiomatic way to write
awk '{$2=FNR} {print $0}' file.txt
you can think of simple awk program as awk 'cond1{action1} cond2{action2} ...'
only if cond1 evaluates to true, action1 is executed and so on. If action portion is omitted, awk by default prints input record. 1 is simply one way to write always true condition
See Idiomatic awk mentioned in https://stackoverflow.com/tags/awk/info for more such idioms
Following awk may also help you in same here.
awk '{sub(/.*/,FNR,$2)} 1' Input_file
Output will be as follows.
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
Explanation: It's explanation will be simple, using sub utility of awk to substitute everything in $2(second field) with FNR which is out of the box variable for awk to represent the current line number of any Input_file then mentioning 1 will print the current line of Input_file.

Grouping elements by two fields on a space delimited file

I have this ordered data by column 2 then 3 and then 1 in a space delimited file (i used linux sort to do that):
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
I want to create a new file (leaving the old file as is)
0 2 0,1,2
1 4 1,2
Basically put the fields 2 and 3 first and group the elements of field 1 (as a comma separated list) by them. Is there a way to do that by an awk, sed, bash one liner, so to avoid writing a Java, C++ app for that?
Since the file is already ordered, you can print the line as they change:
awk '
seen==$2 FS $3 { line=line "," $1; next }
{ if(seen) print seen, line; seen=$2 FS $3; line=$1 }
END { print seen, line }
' file
0 2 0,1,2
1 4 1,2
This will preserve the order of output.
with your input and output this line may help:
awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}
{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' file
test:
kent$ cat f
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
kent$ awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' f
0 2 0,1,2
1 4 1,2
awk 'a[$2, $3]++ { p = p "," $1; next } p { print p } { p = $2 FS $3 FS $1 } END { if (p) print p }' file
Output:
0 2 0,1,2
1 4 1,2
The solution assumes data on second and third column is sorted.
Using awk:
awk '{k=$2 OFS $3} !(k in a){a[k]=$1; b[++n]=k; next} {a[k]=a[k] "," $1}
END{for (i=1; i<=n; i++) print b[i],a[b[i]]}' file
0 2 0,1,2
1 4 1,2
Yet another take:
awk -v SUBSEP=" " '
{group[$2,$3] = group[$2,$3] $1 ","}
END {
for (g in group) {
sub(/,$/,"",group[g])
print g, group[g]
}
}
' file > newfile
The SUBSEP variable is the character that joins strings in a single-dimensional awk array.
http://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional
This might work for you (GNU sed):
sed -r ':a;$!N;/(. (. .).*)\n(.) \2.*/s//\1,\3/;ta;s/(.) (.) (.)/\2 \3 \1/;P;D' file
This appends the first column of the subsequent record to the first record until the second and third keys change. Then the fields in the first record are re-arranged and printed out.
This uses the data presented but can be adapted for more complex data.

Counting unique strings where there's a single string per line in bash

Given input file
z
b
a
f
g
a
b
...
I want to output the number of occurrences of each string, for example:
z 1
b 2
a 2
f 1
g 1
How can this be done in a bash script?
You can sort the input and pass to uniq -c:
$ sort input_file | uniq -c
2 a
2 b
1 f
1 g
1 z
If you want the numbers on the right, use awk to switch them:
$ sort input_file | uniq -c | awk '{print $2, $1}'
a 2
b 2
f 1
g 1
z 1
Alternatively, do the whole thing in awk:
$ awk '
{
++count[$1]
}
END {
for (word in count) {
print word, count[word]
}
}
' input_file
f 1
g 1
z 1
a 2
b 2
cat text | sort | uniq -c
should do the job
Try:
awk '{ freq[$1]++; } END{ for( c in freq ) { print c, freq[c] } }' test.txt
Where test.txt would be your input file.
Here's a bash-only version (requires bash version 4), using an associative array.
#! /bin/bash
declare -A count
while read val ; do
count[$val]=$(( ${count[$val]} + 1 ))
done < your_intput_file # change this as needed
for key in ${!count[#]} ; do
echo $key ${count[$key]}
done
This might work for you:
cat -n file |
sort -k2,2 |
uniq -cf1 |
sort -k2,2n |
sed 's/^ *\([^ ]*\).*\t\(.*\)/\2 \1/'
This output the number of occurrences of each string in the order in which they appear.
You can use sort filename | uniq -c.
Have a look at the Wikipedia page on uniq.

Resources