Grouping elements by two fields on a space delimited file - bash

I have this ordered data by column 2 then 3 and then 1 in a space delimited file (i used linux sort to do that):
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
I want to create a new file (leaving the old file as is)
0 2 0,1,2
1 4 1,2
Basically put the fields 2 and 3 first and group the elements of field 1 (as a comma separated list) by them. Is there a way to do that by an awk, sed, bash one liner, so to avoid writing a Java, C++ app for that?

Since the file is already ordered, you can print the line as they change:
awk '
seen==$2 FS $3 { line=line "," $1; next }
{ if(seen) print seen, line; seen=$2 FS $3; line=$1 }
END { print seen, line }
' file
0 2 0,1,2
1 4 1,2
This will preserve the order of output.

with your input and output this line may help:
awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}
{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' file
test:
kent$ cat f
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
kent$ awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' f
0 2 0,1,2
1 4 1,2

awk 'a[$2, $3]++ { p = p "," $1; next } p { print p } { p = $2 FS $3 FS $1 } END { if (p) print p }' file
Output:
0 2 0,1,2
1 4 1,2
The solution assumes data on second and third column is sorted.

Using awk:
awk '{k=$2 OFS $3} !(k in a){a[k]=$1; b[++n]=k; next} {a[k]=a[k] "," $1}
END{for (i=1; i<=n; i++) print b[i],a[b[i]]}' file
0 2 0,1,2
1 4 1,2

Yet another take:
awk -v SUBSEP=" " '
{group[$2,$3] = group[$2,$3] $1 ","}
END {
for (g in group) {
sub(/,$/,"",group[g])
print g, group[g]
}
}
' file > newfile
The SUBSEP variable is the character that joins strings in a single-dimensional awk array.
http://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional

This might work for you (GNU sed):
sed -r ':a;$!N;/(. (. .).*)\n(.) \2.*/s//\1,\3/;ta;s/(.) (.) (.)/\2 \3 \1/;P;D' file
This appends the first column of the subsequent record to the first record until the second and third keys change. Then the fields in the first record are re-arranged and printed out.
This uses the data presented but can be adapted for more complex data.

Related

Awk if else with conditions

I am trying to make a script (and a loop) to extract matching lines to print them into a new file. There are 2 conditions: 1st is that I need to print the value of the 2nd and 4th columns of the map file if the 2nd column of the map file matches with the 4th column of the test file. The 2nd condition is that when there is no match, I want to print the value in the 2nd column of the test file and a zero in the second column.
My test file is made this way:
8 8:190568 0 190568
8 8:194947 0 194947
8 8:197042 0 197042
8 8:212894 0 212894
My map file is made this way:
8 190568 0.431475 0.009489
8 194947 0.434984 0.009707
8 19056880 0.395066 112.871160
8 101908687 0.643861 112.872348
1st attempt:
for chr in {21..22};
do
awk 'NR==FNR{a[$2]; next} {if ($4 in a) print $2, $4 in a; else print $2, $4 == "0"}' map_chr$chr.txt test_chr$chr.bim > position.$chr;
done
Result:
8:190568 1
8:194947 1
8:197042 0
8:212894 0
My second script is:
for chr in {21..22}; do
awk 'NR == FNR { ++a[$4]; next }
$4 in a { print a[$2], $4; ++found[$2] }
END { for(k in a) if (!found[k]) print a[k], 0 }' \
"test_chr$chr.bim" "map_chr$chr.txt" >> "position.$chr"
done
And the result is:
1 0
1 0
1 0
1 0
The result I need is:
8:190568 0.009489
8:194947 0.009707
8:197042 0
8:212894 0
This awk should work for you:
awk 'FNR==NR {map[$2]=$4; next} {print $4, map[$4]+0}' mapfile testfile
190568 0.009489
194947 0.009707
197042 0
212894 0
This awk command processes mapfile first and stores $2 as key with $4 as a value in an associative array named as map.
Later when it processes testfile in 2nd block we print $4 from 2nd file with the stored value in map using key as $4. We add 0 in stored value to make sure that we get 0 when $4 is not present in map.

How to sum rows in a tsv file using awk?

My input:
Position A B C D No
1 0 0 0 0 0
2 1 0 1 0 0
3 0 6 0 0 0
4 0 0 0 0 0
5 0 5 0 0 0
I have a TSV file, like the above, where I wish to sum the rows of numbers in the ABCD columns only, not the Position column.
Desired output would have a TSV, two columns with Position and Sum in the first row,
Position Sum
1 0
2 2
3 6
4 0
5 5
So far I have:
awk 'BEGIN{print"Position\tSum"}{if(NR==1)next; sum=$2+$3+$4+$5 printf"%d\t%d\n",$sum}' infile.tsv > outfile.tsv
You were very close, try this:
awk 'BEGIN{print"Position\tSum"}{if(NR==1)next; sum=$2+$3+$4+$5; printf "%d\t%d\n",$1,sum; }' infile.tsv > outfile.tsv
But I say it's way cleaner with newlines and spaces:
awk '
BEGIN {
print"Position\tSum";
}
{
if (NR==1) {
next;
}
sum = $2 + $3 + $4 + $5 + $6;
printf "%d\t%d\n", $1, sum;
}'
a minimalist script can be
$ awk '{print $1 "\t" (NR==1?"Sum":$2+$3+$4+$5)}' file
Could you please try following, what you were trying to hard code field numbers which will not work in many cases so I am coming with a loop approach(where we are skipping first field and taking sum of all fields then).
awk 'FNR==1{print $1,"sum";next} {for(i=2;i<NF;i++){sum+=$i};print $1,sum;sum=""}' Input_file
Change awk to awk 'BEGIN{OFS="\t"} rest part same of code in case you need output in TAB form.

Bash: replacing a column by another and using AWK to print specific order

I have a dummy file that looks like so:
a ID_1 S1 S2
b SNP1 1 0
c SNP2 2 1
d SNP3 1 0
I want to replace the contents of column 2 by the corresponding line number. My file would then look like so:
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
I can do this with the following command:
cut -f 1,3-4 -d " " file.txt | awk '{print $1 " " FNR " " $2,$3}'
My question is, is there a better way of doing this? In particular, the real file I am working on has 2303 columns. Obviously I don't want to have to write:
cut -f 1,3-2303 -d " " file.txt | awk '{print $1 " " FNR " " $2,$3,$4,$5 ETC}'
Is there a way to tell awk to print from column 2 to the last column without having to write all the names?
Thanks
I think this should do
$ awk '{$2=FNR} 1' file.txt
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
change second column and print the changed record. Default OFS is single space which is what you need here
the above command is idiomatic way to write
awk '{$2=FNR} {print $0}' file.txt
you can think of simple awk program as awk 'cond1{action1} cond2{action2} ...'
only if cond1 evaluates to true, action1 is executed and so on. If action portion is omitted, awk by default prints input record. 1 is simply one way to write always true condition
See Idiomatic awk mentioned in https://stackoverflow.com/tags/awk/info for more such idioms
Following awk may also help you in same here.
awk '{sub(/.*/,FNR,$2)} 1' Input_file
Output will be as follows.
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
Explanation: It's explanation will be simple, using sub utility of awk to substitute everything in $2(second field) with FNR which is out of the box variable for awk to represent the current line number of any Input_file then mentioning 1 will print the current line of Input_file.

Bash: extract columns with cut and filter one column further

I have a tab-separated file and want to extract a few columns with cut.
Two example line
(...)
0 0 1 0 AB=1,2,3;CD=4,5,6;EF=7,8,9 0 0
1 1 0 0 AB=2,1,3;CD=1,1,2;EF=5,3,4 0 1
(...)
What I want to achieve is to select columns 2,3,5 and 7, however from column 5 only CD=4,5,6.
So my expected result is
0 1 CD=4,5,6; 0
1 0 CD=1,1,2; 1
How can I use cut for this problem and run grep on one of the extracted columns? Any other one-liner is of course also fine.
here is another awk
$ awk -F'\t|;' -v OFS='\t' '{print $2,$3,$6,$NF}' file
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
or with cut/paste
$ paste <(cut -f2,3 file) <(cut -d';' -f2 file) <(cut -f7 file)
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
Easier done with awk. Split the 5th field using ; as the separator, and then print the second subfield.
awk 'BEGIN {FS="\t"; OFS="\t"}
{split($5, a, ";"); print $2, $3, a[2]";", $7 }' inputfile > outputfile
If you want to print whichever subfield begins with CD=, use a loop:
awk 'BEGIN {FS="\t"; OFS="\t"}
{n = split($5, a, ";");
for (i = 1; i <= n; i++) {
if (a[i] ~ /^CD=/) subfield = a[i];
}
print $2, $3, subfield";", $7}' < inputfile > outputfile
I think awk is the best tool for this kind of task and the other two answers give you good short solutions.
I want to point out that you can use awk's built-in splitting facility to gain more flexibility when parsing input. Here is an example script that uses implicit splitting:
parse.awk
# Remember second, third and seventh columns
{
a = $2
b = $3
d = $7
}
# Split the fifth column on ";". After this the positional variables
# (e.g. $1, # $2, ..., $NF) contain the fields from the previous
# fifth column
{
oldFS = FS
FS = ";"
$0 = $5
}
# For example to test if the second elemnt starts with "CD", do
# something like this
$2 ~ /^CD/ {
c = $2
}
# Print the selected elements
{
print a, b, c, d
}
# Restore FS
{
FS = oldFS
}
Run it like this:
awk -f parse.awk FS='\t' OFS='\t' infile
Output:
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1

Add new field at the end of each line based on value of existing field (sed or awk)

I have a set of CSV files which I wish to add a field at the end of each line.
The first field is an ID, some ten-digit number:
id,2nd_field,...,last_field
1234567890,Smith,...,Arkansas
1234567891,Jones,...,California
1234567892,White,...,
I want to add another field at the end where the value is based on modulo 3 (id % 3) of the ID:
id,2nd_field,...,last_field,added_field
1234567890,Smith,...,Arkansas,x
1234567891,Jones,...,California,y
1234567892,White,...,,z
Please take into account the fact that the last_field could be null or blank.
How to do this using sed or awk? I'm a newbie on using these tools, kindly provide as well some explanation to your script. Thanks.
Using awk:
awk 'BEGIN{FS=OFS=","} NR==1{print $0, "added_field"; next}
($1%3)==0{p="x"} ($1%3)==1{p="y"} ($1%3)==2{p="z"} {print $0, p}' file
Output:
id,2nd_field,...,last_field,added_field
1234567890,Smith,...,Arkansas,x
1234567891,Jones,...,California,y
1234567892,White,...,,z
$ cat tst.awk
BEGIN { FS=OFS=","; split("y,z,x",map) }
{ print $0, (NR>1 ? map[($1-1)%3+1] : "added_field") }
$ awk -f tst.awk file
id,2nd_field,...,last_field,added_field
1234567890,Smith,...,Arkansas,x
1234567891,Jones,...,California,y
1234567892,White,...,,z
The above just uses split() to create a mapping of:
map[1] = y
map[2] = z
map[3] = x
and then accesses it when needed via the common (VALUE-1)%N+1 syntax that maps mod N results for values 1,2,..,N-1,N to 1,2,..,N-1,N instead of 1,2,..,N-1,0:
map[($1-1)%3+1]
e.g.:
$ awk 'BEGIN{ for (i=1;i<=6;i++) print i, i%3, (i-1)%3+1 }'
1 1 1
2 2 2
3 0 3
4 1 1
5 2 2
6 0 3

Resources