I'm trying to split multiple comma-separated values into rows.
I have achieved it in a small number of columns with comma-separated values (with awk), but in my real table, I have to do it in 80 columns. So, I'm looking for a way to iterate.
An example of input that I need to split:
CHROM POS REF ALT GT_00 C_00 D_OO E_00 F_00 GT_11
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
The expected output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
I have done it with the following code:
awk 'BEGIN{FS=OFS="\t"}
{
j=split($5,a,",");split($6,b,",");
split($7,c,",");split($8,d,",");split($9,e,",");
for(i=1;i<=j;++i)
{
$5=a[i];$6=b[i];$7=c[i];$8=d[i];$9=e[i];print
}}'
But, as I have said before, there are 80 columns (or more) with comma-separated values in my real data.
Is there a way to it using iteration?
Note: I need to do it in bash (not MySQL, SQL, python...)
This awk may do:
file:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
Solution:
awk '{
n=0;
for(i=5;i<=NF;i++) {
t=split($i,a,",");if(t>n) n=t};
for(j=1;j<=n;j++) {
printf "%s\t%s\t%s\t%s",$1,$2,$3,$4;
for(i=5;i<=NF;i++) {
split($i,a,",");printf "\t%s",(a[j]?a[j]:a[1])
};
print ""
}
}' file
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7
Your test input gives:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
It does not mater if comma separated values are consecutive, as long as you do not mix 2 or 3 comma on the same line.
Here is another awk. In contrast to the previous solutions where we split fields into arrays, we attack the problem differently using substitutions. There is no field iterating going on:
awk '
BEGIN { OFS="\t" }
{ $1=$1;t=$0; }
{ while(index($0,",")) {
gsub(/,[[:alnum:],]*/,""); print;
$0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
}
print t
}' file
how does it work:
The idea is based on 2 types of substitutions:
gsub(/,[[:alnum:],]*/,""): this removes all substrings made from alphanumeric characters and commas that start with a comma: 1,2,3,4 -> 1. This does not change fields that have no comma.
gsub(OFS "[[:alnum:]]*,",OFS): this removes alphanumeric characters followed by a single comma and are in the beginning of the field: 1,2,3,4 -> 2,3,4
So using these two substitutions, we iterate until no comma is left. See How can you tell which characters are in which character classes? on details for [[:alnum:]]
input:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7
Thank you for your interest.
The original data
land_cover_classes rows columns LandCoverDist
"1 of 18" "1 of 720" "1 of 1440" 20
"1 of 18" "1 of 720" "2 of 1440" 0
"1 of 18" "1 of 720" "3 of 1440" 0
"10 of 18" "1 of 720" "4 of 1440" 1
"9 of 18" "110 of 720" "500 of 1440" 0
"1 of 18" "1 of 720" "6 of 1440" 354
"1 of 18" "1 of 720" "7 of 1440" 0
"1 of 18" "1 of 720" "8 of 1440" 0
"1 of 18" "720 of 720" "1440 of 1440" 0
And the expected should be
land_cover_classes rows columns LandCoverDist
1 1 1 20
......
9 110 500 0
1 1 6 354
......
1 720 1440 0
$ awk -F'["[:space:]]+' 'NR>1{$0 = $2 OFS $5 OFS $8 OFS $11} 1' file
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0
$ awk -F'["[:space:]]+' 'NR>1{$0 = $2 OFS $5 OFS $8 OFS $11} 1' file | column -t
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0
Awk solution:
awk 'BEGIN{ FS="\"[[:space:]]+"; OFS="\t" }
function get_num(n){
gsub(/^"| of.*/,"",n);
return n
}
NR==1; NR>1{ print get_num($1), get_num($2), get_num($3), $4 }' file
The output:
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0
$ awk '
BEGIN { FS="\" *\"?" }
NR==1 # print header
{
for(i=2;i<=NF;i++) { # starting from the second field
split($i,a," of ") # split at _of_
printf "%s%s", a[1], (i==NF?ORS:OFS) # print the first part and separator
}
}' file
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0
Good nigt. I have this two files:
File 1 - with phenotype informations, the first column are the Ids, the orinal file has 400 rows:
ID a b c d
215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745
File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
217 2 0 2 1
218 0 2 0 2
And I need to remove from file 2 individuals that do not appear in the file 1, for example:
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
I used this code:
awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file2 file1 > file3
and I can get this output(file 3):
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
but I lose the header, how do I not lose the header?
To keep the header of the second file, add a condition{action} like this:
awk 'NR==FNR {a[$1]; next}
FNR==1 {print $0; next} # <= this will print the header of file2.
$1 in a {print $0}' file1 file2
NR holds the total record number while FNR is the file record number, it counts the records of the file currently being processed. Also the next statements are important, so that to continue with the next record and don't try the rest of the actions.
I have a PLINK ped file that looks like this:
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I am interested in counting how many time the string "2" occurs between the 7th field (included) and the last field. Then, if the number of occurrences is:
0: add 1 (being absent) to the new last field
1: add 2 (being present) to the new last field
2: add 2 (being present) to the new last field
The output would be:
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I know that to count the occurence of a string in every field I can use:
grep -n -o "2" file1 | sort -n | uniq -c | cut -d : -f 1
And that I can merge the 2 results using:
paste -d' ' file1 file2 > file3
But I don't know how to count the occurrences between two fields.
Thank you in advance for helping me!
You can use awk to check for column, row based data:
awk '{c=0; for(i=7; i<=NF; i++) if ($i==2) c++; if (c<2) c++; print $0, c}' file
ACS_D132 ACS_D132 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D140 ACS_D140 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2
ACS_D141 ACS_D141 0 0 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2
ACS_D147 ACS_D147 0 0 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2
ACS_D155 ACS_D155 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D196 ACS_D196 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ACS_D221 ACS_D221 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Perl to the rescue:
perl -ape 's/$/" " . (1 + !! grep 2 == $_, #F[6 .. $#F])/e'
-p reads the input line by line and prints the result
-a splits each input line on whitespace into the #F array
grep in scalar context returns the count, by !! (double negation) we change it to 0 or 1, and by adding 1 we make it into 1 and 2 as requested
s/// substitutes $ (end of line) with the result of the code in the replacement part (that's what /e does)
You could use awk:
awk '{s=0;for(i=7;i<=NF;i++) if($i==2) s+=1; s=s==0?1:2; print $0, s;}' data.txt
Explanations:
The instructions between the {} are executed on each line of the file.
NF is the number of fields in the line. They are numbered 1 to NF and you can access them with the $n notation.
I need to insert the zero elements in any sparse matrix in the Matrix Market format (but already without the headers).
The first column is the number of the ROW, the second columns is the number of the COLUMN and the third column is the VALUE of the element.
I'm using a 2 x 3 matrix for testing it. But i need to be able to do this to any dimension matrix, m x n.
The number of rows, columns and non-zero elements of each matrix are already in separated variables.
I've used bash sed and awk to work with these matrices until now.
Input file:
1 1 1.0000
1 2 2.0000
2 1 4.0000
2 2 5.0000
2 3 6.0000
ROWS and COLUMNS are integer %d and VALUES are float %.4f
Here only one element is zero (row 1 column 3), the line that represents it is omitted.
So, how can i insert this line???
Output file:
1 1 1.0000
1 2 2.0000
1 3 0.0000
2 1 4.0000
2 2 5.0000
2 3 6.0000
An empty 2 x 3 matrix would be like this:
1 1 0.0000
1 2 0.0000
1 3 0.0000
2 1 0.0000
2 2 0.0000
2 3 0.0000
Another example, a 3 x 4 matrix with more zero elements.
Input file:
1 2 9.7856
1 4 4.2311
2 1 3.4578
2 2 45.1231
2 3 -12.0124
3 4 0.1245
Output file:
1 1 0.0000
1 2 9.7856
1 3 0.0000
1 4 4.2311
2 1 3.4578
2 2 45.1231
2 3 -12.0124
2 4 0.0000
3 1 0.0000
3 2 0.0000
3 3 0.0000
3 4 0.1245
I hope you can help me. I already spent more then 3 days with trying a solution.
The best i got was this:
for((i=1;i<3;i++))
do
for((j=1;j<4;j++))
do
awk -v I=${i} -v J=${j} 'BEGIN{FS=" "}
{if($1==I && $2==J)
printf("%d %d %.4f\n",I,J,$3)
else
printf("%d %d %d\n",I,J,0)
}' ./etc/A.2
done
done
But its not efficient and prints lots of non desired lines:
1 1 1.0000
1 1 0
1 1 0
1 1 0
1 1 0
1 2 0
1 2 2.0000
1 2 0
1 2 0
1 2 0
1 3 0
1 3 0
1 3 0
1 3 0
1 3 0
2 1 0
2 1 0
2 1 4.0000
2 1 0
2 1 0
2 2 0
2 2 0
2 2 0
2 2 5.0000
2 2 0
2 3 0
2 3 0
2 3 0
2 3 0
2 3 6.0000
Please! Help me! Thank you all!
If you want to specify the max "I" and "J" values:
# cat tst.awk
{ a[$1,$2] = $3 }
END {
for (i=1;i<=I;i++)
for (j=1;j<=J;j++)
print i, j, ( (i,j) in a ? a[i,j] : "0.0000" )
}
$ awk -v I=2 -v J=3 -f tst.awk file
1 1 1.0000
1 2 2.0000
1 3 0.0000
2 1 4.0000
2 2 5.0000
2 3 6.0000
If you'd rather the tool just figures it out (won't work for an empty file or if the max desired values are otherwise never populated):
$ cat tst2.awk
NR==1 { I=$1; J=$2 }
{
a[$1,$2] = $3
I = (I > $1 ? I : $1)
J = (J > $2 ? J : $2)
}
END {
for (i=1;i<=I;i++)
for (j=1;j<=J;j++)
print i, j, ( (i,j) in a ? a[i,j] : "0.0000" )
}
$ awk -f tst2.awk file
1 1 1.0000
1 2 2.0000
1 3 0.0000
2 1 4.0000
2 2 5.0000
2 3 6.0000