I'm trying to split multiple comma-separated values into rows.
I have achieved it in a small number of columns with comma-separated values (with awk), but in my real table, I have to do it in 80 columns. So, I'm looking for a way to iterate.
An example of input that I need to split:
CHROM POS REF ALT GT_00 C_00 D_OO E_00 F_00 GT_11
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
The expected output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
I have done it with the following code:
awk 'BEGIN{FS=OFS="\t"}
{
j=split($5,a,",");split($6,b,",");
split($7,c,",");split($8,d,",");split($9,e,",");
for(i=1;i<=j;++i)
{
$5=a[i];$6=b[i];$7=c[i];$8=d[i];$9=e[i];print
}}'
But, as I have said before, there are 80 columns (or more) with comma-separated values in my real data.
Is there a way to it using iteration?
Note: I need to do it in bash (not MySQL, SQL, python...)
This awk may do:
file:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
Solution:
awk '{
n=0;
for(i=5;i<=NF;i++) {
t=split($i,a,",");if(t>n) n=t};
for(j=1;j<=n;j++) {
printf "%s\t%s\t%s\t%s",$1,$2,$3,$4;
for(i=5;i<=NF;i++) {
split($i,a,",");printf "\t%s",(a[j]?a[j]:a[1])
};
print ""
}
}' file
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7
Your test input gives:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
It does not mater if comma separated values are consecutive, as long as you do not mix 2 or 3 comma on the same line.
Here is another awk. In contrast to the previous solutions where we split fields into arrays, we attack the problem differently using substitutions. There is no field iterating going on:
awk '
BEGIN { OFS="\t" }
{ $1=$1;t=$0; }
{ while(index($0,",")) {
gsub(/,[[:alnum:],]*/,""); print;
$0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
}
print t
}' file
how does it work:
The idea is based on 2 types of substitutions:
gsub(/,[[:alnum:],]*/,""): this removes all substrings made from alphanumeric characters and commas that start with a comma: 1,2,3,4 -> 1. This does not change fields that have no comma.
gsub(OFS "[[:alnum:]]*,",OFS): this removes alphanumeric characters followed by a single comma and are in the beginning of the field: 1,2,3,4 -> 2,3,4
So using these two substitutions, we iterate until no comma is left. See How can you tell which characters are in which character classes? on details for [[:alnum:]]
input:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7
I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you
For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40
You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a file containing lines of the form
this is block 1
a 1 2 3
this is block 2
a 3 1 9
this is block 3
a 10 2 32
...
I would like to copy some of the lines and replace the beginning so that it becomes:
this is block 1
a 1 2 3
b 1 2 3
c 1 2 3
this is block 2
a 3 1 9
b 3 1 9
c 3 1 9
this is block 3
a 10 2 32
b 10 2 32
c 10 2 32
...
Not sure how to solve this with awk, sed or another elegant option.
Using sed or awk you can do it in this way:
$ cat data
this is block 1
a 1 2 3
this is block 2
a 3 1 9
this is block 3
a 10 2 32
$ sed -rn '/^a/{p;s/^a/b/;p;s/^b/c/;p;n};p' data
this is block 1
a 1 2 3
b 1 2 3
c 1 2 3
this is block 2
a 3 1 9
b 3 1 9
c 3 1 9
this is block 3
a 10 2 32
b 10 2 32
c 10 2 32
$ awk '{if($1=="a") {print;$1="b";print;$1="c";print} else {print}}' data
this is block 1
a 1 2 3
b 1 2 3
c 1 2 3
this is block 2
a 3 1 9
b 3 1 9
c 3 1 9
this is block 3
a 10 2 32
b 10 2 32
c 10 2 32
$
I have a data file of:
1 2 3
1 5 7
2 5 9
11 21 110
6 17 -2
10 2 8
6 4 3
5 1 8
6 1 5
7 3 1
I want to add number 1 to the third column, only for line number 1, 3, 6, 8, 9, 10. And add 2 to the second column, for line number 6~9.
I know how to add 2 to entire second column, and add 1 to entire third column using awk
awk '{print $1, $2+2, $3+1}' data > data2
But how can I modify this code to specific lines of second and third column?
Thanks
Best,
awk to the rescue! You can check for NR in the condition, but for 6 values it will be tedious, alternatively you can check for string match with anchored NR.
$ awk 'BEGIN{lines=",1,3,6,8,9,10,"}
match(lines,","NR","){$3++}
NR>=6 && NR<=9{$2+=2}1' nums
1 2 4
1 5 7
2 5 10
11 21 110
6 17 -2
10 4 9
6 6 3
5 3 9
6 3 6
7 3 2
$ cat tst.awk
BEGIN {
for (i=6;i<=9;i++) {
d[2,i] = 2
}
split("1 3 6 8 9 10",t);
for (i in t) {
d[3,t[i]] = 1
}
}
{ $2 += d[2,NR]; $3 += d[3,NR]; print }
$ awk -f tst.awk file
1 2 4
1 5 7
2 5 10
11 21 110
6 17 -2
10 4 9
6 6 3
5 3 9
6 3 6
7 3 2
Input file:
AAA 2 3 4 5
BBB 3 4 5
AAA 23 21 34
BBB 4 5 62
I want the output to be:
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
I feel that I should use awk and sed but not sure how to realize it. Does anyone have any good ideas? Thanks.
This might work for you:
sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]* \)\(.*\)\n\1/\1\2/;ta;P;D'
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
or gnu awk;
awk '{if($1 in a){line=$0;sub(/[^ ]* /,"",line);a[$1]=a[$1]line;next};a[$1]=$0}END{n=asort(a);for(i=1;i<=n;i++)print a[i]}' file
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
Here is an awk 1 liner to solve above problem:
awk '{line=$2;for(i=3; i<=NF; i++) line=line " " $i; arr[$1]=arr[$1] " " line} END{for (val in arr) print val, arr[val]}' file
Using bash version 4's associative arrays
$ declare -A vals
$ while read key nums; do vals[$key]+="$nums "; done < filename
$ for key in "${!vals[#]}"; do printf "%s %s\n" "$key" "${vals[$key]}"; done
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62