How can I split comma separated values into multiple rows? - bash

I'm trying to split multiple comma-separated values into rows.
I have achieved it in a small number of columns with comma-separated values (with awk), but in my real table, I have to do it in 80 columns. So, I'm looking for a way to iterate.
An example of input that I need to split:
CHROM POS REF ALT GT_00 C_00 D_OO E_00 F_00 GT_11
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
The expected output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
I have done it with the following code:
awk 'BEGIN{FS=OFS="\t"}
{
j=split($5,a,",");split($6,b,",");
split($7,c,",");split($8,d,",");split($9,e,",");
for(i=1;i<=j;++i)
{
$5=a[i];$6=b[i];$7=c[i];$8=d[i];$9=e[i];print
}}'
But, as I have said before, there are 80 columns (or more) with comma-separated values in my real data.
Is there a way to it using iteration?
Note: I need to do it in bash (not MySQL, SQL, python...)

This awk may do:
file:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
Solution:
awk '{
n=0;
for(i=5;i<=NF;i++) {
t=split($i,a,",");if(t>n) n=t};
for(j=1;j<=n;j++) {
printf "%s\t%s\t%s\t%s",$1,$2,$3,$4;
for(i=5;i<=NF;i++) {
split($i,a,",");printf "\t%s",(a[j]?a[j]:a[1])
};
print ""
}
}' file
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7
Your test input gives:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
It does not mater if comma separated values are consecutive, as long as you do not mix 2 or 3 comma on the same line.

Here is another awk. In contrast to the previous solutions where we split fields into arrays, we attack the problem differently using substitutions. There is no field iterating going on:
awk '
BEGIN { OFS="\t" }
{ $1=$1;t=$0; }
{ while(index($0,",")) {
gsub(/,[[:alnum:],]*/,""); print;
$0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
}
print t
}' file
how does it work:
The idea is based on 2 types of substitutions:
gsub(/,[[:alnum:],]*/,""): this removes all substrings made from alphanumeric characters and commas that start with a comma: 1,2,3,4 -> 1. This does not change fields that have no comma.
gsub(OFS "[[:alnum:]]*,",OFS): this removes alphanumeric characters followed by a single comma and are in the beginning of the field: 1,2,3,4 -> 2,3,4
So using these two substitutions, we iterate until no comma is left. See How can you tell which characters are in which character classes? on details for [[:alnum:]]
input:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7

Related

How to substract every nth from (n+3)th line in awk?

I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you
For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40
You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .

AWK: Add number to the column for specific line

I have a data file of:
1 2 3
1 5 7
2 5 9
11 21 110
6 17 -2
10 2 8
6 4 3
5 1 8
6 1 5
7 3 1
I want to add number 1 to the third column, only for line number 1, 3, 6, 8, 9, 10. And add 2 to the second column, for line number 6~9.
I know how to add 2 to entire second column, and add 1 to entire third column using awk
awk '{print $1, $2+2, $3+1}' data > data2
But how can I modify this code to specific lines of second and third column?
Thanks
Best,
awk to the rescue! You can check for NR in the condition, but for 6 values it will be tedious, alternatively you can check for string match with anchored NR.
$ awk 'BEGIN{lines=",1,3,6,8,9,10,"}
match(lines,","NR","){$3++}
NR>=6 && NR<=9{$2+=2}1' nums
1 2 4
1 5 7
2 5 10
11 21 110
6 17 -2
10 4 9
6 6 3
5 3 9
6 3 6
7 3 2
$ cat tst.awk
BEGIN {
for (i=6;i<=9;i++) {
d[2,i] = 2
}
split("1 3 6 8 9 10",t);
for (i in t) {
d[3,t[i]] = 1
}
}
{ $2 += d[2,NR]; $3 += d[3,NR]; print }
$ awk -f tst.awk file
1 2 4
1 5 7
2 5 10
11 21 110
6 17 -2
10 4 9
6 6 3
5 3 9
6 3 6
7 3 2

Matrix addition over multiple files using e.g. awk

I have a bunch of files from simulation output, all with the same number of rows and fields.
What I need to do is to combine them, so that I get only one file with the numbers summed up, which basically resembles the addition of several matrices.
Example:
File1.txt
1 1 1
1 1 1
1 1 1
File2.txt
2 2 2
2 2 2
2 2 2
File3.txt
3 3 3
3 3 3
3 3 3
required output
6 6 6
6 6 6
6 6 6
I'm going to integrate this into some larger Shell-script, therefore I would prefer a solution in awk, though other languages are welcome as well.
awk '{for(i=1;i<=NF;i++)a[FNR,i]=$i+a[FNR,i]}
END{for(i=1;i<=FNR;i++)
for(j=1;j<=NF;j++)printf "%s%s", a[i,j],(j==NF?"\n":FS)}' f1 f2 f3
input files could be more than 3
test with your data:
kent$ head f[1-3]
==> f1 <==
1 1 1
1 1 1
1 1 1
==> f2 <==
2 2 2
2 2 2
2 2 2
==> f3 <==
3 3 3
3 3 3
3 3 3
kent$ awk '{for(i=1;i<=NF;i++)a[FNR,i]=$i+a[FNR,i]}END{for(i=1;i<=FNR;i++)for(j=1;j<=NF;j++)printf "%s%s", a[i,j],(j==NF?"\n":FS)}' f1 f2 f3
6 6 6
6 6 6
6 6 6
Quick hack:
paste f1 f2 f3 | awk '{for(i=1;i<=m;i++)printf "%d%s",$i+$(i+m)+$(i+2*m),i==m?ORS:OFS}' m=3

How to insert zero elements in sparse matrix using shell, bash, awk, sed

I need to insert the zero elements in any sparse matrix in the Matrix Market format (but already without the headers).
The first column is the number of the ROW, the second columns is the number of the COLUMN and the third column is the VALUE of the element.
I'm using a 2 x 3 matrix for testing it. But i need to be able to do this to any dimension matrix, m x n.
The number of rows, columns and non-zero elements of each matrix are already in separated variables.
I've used bash sed and awk to work with these matrices until now.
Input file:
1 1 1.0000
1 2 2.0000
2 1 4.0000
2 2 5.0000
2 3 6.0000
ROWS and COLUMNS are integer %d and VALUES are float %.4f
Here only one element is zero (row 1 column 3), the line that represents it is omitted.
So, how can i insert this line???
Output file:
1 1 1.0000
1 2 2.0000
1 3 0.0000
2 1 4.0000
2 2 5.0000
2 3 6.0000
An empty 2 x 3 matrix would be like this:
1 1 0.0000
1 2 0.0000
1 3 0.0000
2 1 0.0000
2 2 0.0000
2 3 0.0000
Another example, a 3 x 4 matrix with more zero elements.
Input file:
1 2 9.7856
1 4 4.2311
2 1 3.4578
2 2 45.1231
2 3 -12.0124
3 4 0.1245
Output file:
1 1 0.0000
1 2 9.7856
1 3 0.0000
1 4 4.2311
2 1 3.4578
2 2 45.1231
2 3 -12.0124
2 4 0.0000
3 1 0.0000
3 2 0.0000
3 3 0.0000
3 4 0.1245
I hope you can help me. I already spent more then 3 days with trying a solution.
The best i got was this:
for((i=1;i<3;i++))
do
for((j=1;j<4;j++))
do
awk -v I=${i} -v J=${j} 'BEGIN{FS=" "}
{if($1==I && $2==J)
printf("%d %d %.4f\n",I,J,$3)
else
printf("%d %d %d\n",I,J,0)
}' ./etc/A.2
done
done
But its not efficient and prints lots of non desired lines:
1 1 1.0000
1 1 0
1 1 0
1 1 0
1 1 0
1 2 0
1 2 2.0000
1 2 0
1 2 0
1 2 0
1 3 0
1 3 0
1 3 0
1 3 0
1 3 0
2 1 0
2 1 0
2 1 4.0000
2 1 0
2 1 0
2 2 0
2 2 0
2 2 0
2 2 5.0000
2 2 0
2 3 0
2 3 0
2 3 0
2 3 0
2 3 6.0000
Please! Help me! Thank you all!
If you want to specify the max "I" and "J" values:
# cat tst.awk
{ a[$1,$2] = $3 }
END {
for (i=1;i<=I;i++)
for (j=1;j<=J;j++)
print i, j, ( (i,j) in a ? a[i,j] : "0.0000" )
}
$ awk -v I=2 -v J=3 -f tst.awk file
1 1 1.0000
1 2 2.0000
1 3 0.0000
2 1 4.0000
2 2 5.0000
2 3 6.0000
If you'd rather the tool just figures it out (won't work for an empty file or if the max desired values are otherwise never populated):
$ cat tst2.awk
NR==1 { I=$1; J=$2 }
{
a[$1,$2] = $3
I = (I > $1 ? I : $1)
J = (J > $2 ? J : $2)
}
END {
for (i=1;i<=I;i++)
for (j=1;j<=J;j++)
print i, j, ( (i,j) in a ? a[i,j] : "0.0000" )
}
$ awk -f tst2.awk file
1 1 1.0000
1 2 2.0000
1 3 0.0000
2 1 4.0000
2 2 5.0000
2 3 6.0000

Records missing in o/p when setting numReduceTasks=0 in Hadoop Streaming

As already mentioned in the title, can you please suggest what could be the problem.
Command
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
-input /usr/pkansal/ex2/output \
-output /usr/pkansal/ex2/output2 \
-mapper /home/cloudera/ex2/kMerFreqMap2.py \
-file /home/cloudera/ex2/kMerFreqMap2.py \
-numReduceTasks 0 (If I comment this line then things go fine)
I/P
3 chr1:1,chr1:3,chr1:5
1 chr1:7
2 chr1:2,chr1:4
1 chr1:6
Expected O/P
chr1 1 3
chr1 3 3
chr1 5 3
chr1 7 1
chr1 2 2
chr1 4 2
chr1 6 1
Actual O/P
chr1 2 2
chr1 4 2
chr1 6 1

Resources