I am preparing the background gmt file for gProfiler(custom, non-model plant) but what I have is the Gene id on first column, which needs to be in the reverse order. Is there a way in Linux or R to rearrange these data?
**current**
Gene1 GO:A GO:B GO:C GO:D GO:E
Gene2 GO:E GO:F GO:G GO:H GO:I GO:J
Gene3 GO:A GO:D GO:E GO:H GO:J
Gene4 GO:B GO:C GO:E GO:F GO:1J
Gene5 GO:C GO:J
**new**
GO:A Gene1 Gene3
GO:B Gene1 Gene4
GO:C Gene1 Gene4 Gene5
GO:D Gene1 Gene3
GO:E Gene1 Gene2 Gene3 Gene4
GO:F Gene2 Gene4
GO:G Gene2
GO:H Gene2 Gene3
GO:I Gene2
GO:J Gene2 Gene3 Gene4 Gene5
I would like to compare two files to identified common translocations. However, these translocations don't have exactly the same coordinates between the files. So I want to see if the translocation occurs between the same pair of chromosomes (chr1, chr2) and if the coordinates overlap.
Here is an examples for two files:
file_1.txt:
chr1 min1 max1 chr2 min2 max2
1 111111 222222 2 333333 444444
2 777777 888888 3 555555 666666
15 10 100 15 2000 2100
17 500 530 18 700 750
20 123456 234567 20 345678 456789
file_2.txt:
chr1 min1 max1 chr2 min2 max2
1 100000 200000 2 400000 500000
2 800000 900000 3 500000 600000
15 200 300 15 2000 3000
20 150000 200000 20 300000 500000
The objective is that the pair chr1 and chr2 is the same between file 1 and file 2. Then the coordinates min1 and max1 must overlap between the two files. Same thing for min2 and max2.
For the result, perhaps the best solution is to print the two lines as follows:
1 111111 222222 2 333333 444444
1 100000 200000 2 400000 500000
2 777777 888888 3 555555 666666
2 800000 900000 3 500000 600000
20 123456 234567 20 345678 456789
20 150000 200000 20 300000 500000
(For this simplified example, I tried to represent the different types of overlap I could encounter. I hope it is clear enough).
Thank you for your help.
awk to the rescue!
$ awk 'function overlap(x1,y1,x2,y2) {return y1>x2 && y2>x1}
{k=$1 FS $4}
NR==FNR {r[k]=$0; c1min[k]=$2; c1max[k]=$3; c2min[k]=$5; c2max[k]=$6; next}
overlap(c1min[k],c1max[k],$2,$3) &&
overlap(c2min[k],c2max[k],$5,$6) {print r[k] ORS $0 ORS}' file1 file2
1 111111 222222 2 333333 444444
1 100000 200000 2 400000 500000
2 777777 888888 3 555555 666666
2 800000 900000 3 500000 600000
20 123456 234567 20 345678 456789
20 150000 200000 20 300000 500000
assumes the first file can be held in memory and prints an extra empty line at the end.
I have 2 files. File1 contains gene expression- 300 samples(rows) and ~50k genes (50k columns), File2 contains MRI data -the same 300 samples and ~100 columns. I want to correlate GE data with MRI data controlling for covariate which is disease status (file3, status 0 or 1).
I have tried to use ppcor package, for 2 variables it worked:
pcor.test(file1$gene1, file2$var1, file3$status, method="pearson")
but I want to run for all variables, so I merged into 1 file, last columns being status:
sapply(1:(ncol(test)-1), function(x) sapply(1:(ncol(test)-1), function(y) {
if (x == y) 1
else pcor.test(test[,x], test[,y], test[,ncol(test)])$estimate
}))
but had this error:
Error in solve.default(cvx) :
system is computationally singular: reciprocal condition number = 6.36939e-18
Am I doing this correctly? Is this a good method?
Thank you for suggestions and help
Georg
File1
gene1 gene2 gene3 ... gene50,000
Sample1 12 300 70 4000
Sample2 25 100 53 4500
Sample3 70 30 71 2000
...
Sample300 18 200 97 1765
File2
var1 var2 var3 ... var100
Sample1 5 1 170 200
Sample2 7 3 153 100
Sample3 7 18 130 34
...
Sample300 18 54 197 71
File3-STATUS
status
Sample1 1
Sample2 1
Sample3 0
...
Sample300 1
I'm trying to split multiple comma-separated values into rows.
I have achieved it in a small number of columns with comma-separated values (with awk), but in my real table, I have to do it in 80 columns. So, I'm looking for a way to iterate.
An example of input that I need to split:
CHROM POS REF ALT GT_00 C_00 D_OO E_00 F_00 GT_11
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
The expected output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
I have done it with the following code:
awk 'BEGIN{FS=OFS="\t"}
{
j=split($5,a,",");split($6,b,",");
split($7,c,",");split($8,d,",");split($9,e,",");
for(i=1;i<=j;++i)
{
$5=a[i];$6=b[i];$7=c[i];$8=d[i];$9=e[i];print
}}'
But, as I have said before, there are 80 columns (or more) with comma-separated values in my real data.
Is there a way to it using iteration?
Note: I need to do it in bash (not MySQL, SQL, python...)
This awk may do:
file:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
Solution:
awk '{
n=0;
for(i=5;i<=NF;i++) {
t=split($i,a,",");if(t>n) n=t};
for(j=1;j<=n;j++) {
printf "%s\t%s\t%s\t%s",$1,$2,$3,$4;
for(i=5;i<=NF;i++) {
split($i,a,",");printf "\t%s",(a[j]?a[j]:a[1])
};
print ""
}
}' file
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7
Your test input gives:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
It does not mater if comma separated values are consecutive, as long as you do not mix 2 or 3 comma on the same line.
Here is another awk. In contrast to the previous solutions where we split fields into arrays, we attack the problem differently using substitutions. There is no field iterating going on:
awk '
BEGIN { OFS="\t" }
{ $1=$1;t=$0; }
{ while(index($0,",")) {
gsub(/,[[:alnum:],]*/,""); print;
$0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
}
print t
}' file
how does it work:
The idea is based on 2 types of substitutions:
gsub(/,[[:alnum:],]*/,""): this removes all substrings made from alphanumeric characters and commas that start with a comma: 1,2,3,4 -> 1. This does not change fields that have no comma.
gsub(OFS "[[:alnum:]]*,",OFS): this removes alphanumeric characters followed by a single comma and are in the beginning of the field: 1,2,3,4 -> 2,3,4
So using these two substitutions, we iterate until no comma is left. See How can you tell which characters are in which character classes? on details for [[:alnum:]]
input:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7
I have two tab delimitated fires as follow:
File1:
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
File2:
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
I want to merge these two files in a unique file if both values in columns 3 and 4 of file1 are equal to columns 1 and 2 of file2 and to keep all columns of file2 plus column 2 of file1.
output like this:
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
Thank you so much,
Vahid.
I tried this awk command:
awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' file1.tsv file2.tsv
Bu it does not give me the unique output I an looking for and the out put is a combination of both files like this:
chr1 10468 10470 2 100 cg00000292 0.780482425 0.78
chr1 10483 10496 4 264 0.924
chr3 10524 10524 1 47 cg00008493 0.982994402 0.936
chr1 10541 10541 1 64 0.781
chr3 10562 10588 5 510 0.941
chr1 10608 10619 3 243 0.951
chr7 10630 10794 42 5292 cg00006414 0.000000456 0.952
chr1 10810 10815 3 135 0.756
The basic idea here to to read the first file, and using each line's third and fourth columns as a key, save the second column in an array. Then for each line in the second file, if its first two columns were seen in the first file, print that line and the saved second column of the first file.
$ awk 'BEGIN{ FS=OFS="\t" }
NR==FNR { seen[$3,$4]=$2; next }
($1,$2) in seen { print $0, seen[$1,$2] }' file1.tsv file2.tsv
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
# I want to merge these two files in a unique file
# if both values in columns 3 and 4 of file1
# are equal to columns 1 and 2 of file2
# and to keep all columns of file2 plus column 2 of file1.
join -t$'\t' -11 -21 -o2.2,2.3,2.4,2.5,2.6,2.7,2.8,1.3 <(
<file1 awk -vFS=$'\t' -vOFS=$'\t' '{ print $3 $4,$0 }' |
sort -t$'\t' -k1,1
) <(
<file2 awk -vFS=$'\t' -vOFS=$'\t' '{ print $1 $2,$0 }' |
sort -t$'\t' -k1,1
)
First preprocess the files and extract the fields you want to join on.
Sort and join
Specify the output format to join.
Tested on repl against:
# recreate input files
tr -s ' ' <<EOF | tr ' ' '\t' >file1
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
EOF
tr -s ' ' <<EOF | tr ' ' '\t' >file2
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
EOF