Bash/Awk: Find common translocations in two files using overlapping coordinates - bash

I would like to compare two files to identified common translocations. However, these translocations don't have exactly the same coordinates between the files. So I want to see if the translocation occurs between the same pair of chromosomes (chr1, chr2) and if the coordinates overlap.
Here is an examples for two files:
file_1.txt:
chr1 min1 max1 chr2 min2 max2
1 111111 222222 2 333333 444444
2 777777 888888 3 555555 666666
15 10 100 15 2000 2100
17 500 530 18 700 750
20 123456 234567 20 345678 456789
file_2.txt:
chr1 min1 max1 chr2 min2 max2
1 100000 200000 2 400000 500000
2 800000 900000 3 500000 600000
15 200 300 15 2000 3000
20 150000 200000 20 300000 500000
The objective is that the pair chr1 and chr2 is the same between file 1 and file 2. Then the coordinates min1 and max1 must overlap between the two files. Same thing for min2 and max2.
For the result, perhaps the best solution is to print the two lines as follows:
1 111111 222222 2 333333 444444
1 100000 200000 2 400000 500000
2 777777 888888 3 555555 666666
2 800000 900000 3 500000 600000
20 123456 234567 20 345678 456789
20 150000 200000 20 300000 500000
(For this simplified example, I tried to represent the different types of overlap I could encounter. I hope it is clear enough).
Thank you for your help.

awk to the rescue!
$ awk 'function overlap(x1,y1,x2,y2) {return y1>x2 && y2>x1}
{k=$1 FS $4}
NR==FNR {r[k]=$0; c1min[k]=$2; c1max[k]=$3; c2min[k]=$5; c2max[k]=$6; next}
overlap(c1min[k],c1max[k],$2,$3) &&
overlap(c2min[k],c2max[k],$5,$6) {print r[k] ORS $0 ORS}' file1 file2
1 111111 222222 2 333333 444444
1 100000 200000 2 400000 500000
2 777777 888888 3 555555 666666
2 800000 900000 3 500000 600000
20 123456 234567 20 345678 456789
20 150000 200000 20 300000 500000
assumes the first file can be held in memory and prints an extra empty line at the end.

Related

How can I insert lines to data using Linux commands based on conditions?

I have a data file like this:
5000 1
6000 1
7000 2
9000 5
10000 5
11000 6
12000 8
14000 9
15000 10
...
Data is printed in every 1000 numbers of the first column, but some are missing. In the above example, data from 1000 to 4000, and 8000, and 13000 are missing.
I wish to insert new data lines to missing lines of this data, which contains 0 in the second column. So, the result should look like:
1000 0
2000 0
3000 0
4000 0
5000 1
6000 1
7000 2
8000 0
9000 5
10000 5
11000 6
12000 8
13000 0
14000 9
15000 10
...
All the way to the end of the file.
Can I do this using some Linux commands like awk and/or cat? Or should I need to write a shell script using if loop?
$ awk -v step='1000' '
{
for (i=prev+step; i<$1; i+=step) {
print prev+=step, 0
}
print
prev=$1
}
' file
1000 0
2000 0
3000 0
4000 0
5000 1
6000 1
7000 2
8000 0
9000 5
10000 5
11000 6
12000 8
13000 0
14000 9
15000 10
Could you please try following, written and tested with shown samples in GNU awk.
awk '
{
max=(max>$1?max:$1)
arr[$1]=$2
}
END{
for(i=1000;i<=max;i+=1000){
print i, arr[i]+0
}
}
' Input_file
OR create an awk variable named diffStart and give value there in this case its 1000 and change values there whenever you want to change it.
awk -v diffStart="1000" '
{
max=(max>$1?max:$1)
arr[$1]=$2
}
END{
for(i=diffStart;i<=max;i+=diffStart){
print i, arr[i]+0
}
}
' Input_file
Output will be as follows.
1000 0
2000 0
3000 0
4000 0
5000 1
6000 1
7000 2
8000 0
9000 5
10000 5
11000 6
12000 8
13000 0
14000 9
15000 10
$ awk '{while((p+=1000)<$1) print p,0}1' file
1000 0
2000 0
3000 0
4000 0
5000 1
6000 1
7000 2
8000 0
9000 5
10000 5
11000 6
12000 8
13000 0
14000 9
15000 10

Correlation with covariate in R

I have 2 files. File1 contains gene expression- 300 samples(rows) and ~50k genes (50k columns), File2 contains MRI data -the same 300 samples and ~100 columns. I want to correlate GE data with MRI data controlling for covariate which is disease status (file3, status 0 or 1).
I have tried to use ppcor package, for 2 variables it worked:
pcor.test(file1$gene1, file2$var1, file3$status, method="pearson")
but I want to run for all variables, so I merged into 1 file, last columns being status:
sapply(1:(ncol(test)-1), function(x) sapply(1:(ncol(test)-1), function(y) {
if (x == y) 1
else pcor.test(test[,x], test[,y], test[,ncol(test)])$estimate
}))
but had this error:
Error in solve.default(cvx) :
system is computationally singular: reciprocal condition number = 6.36939e-18
Am I doing this correctly? Is this a good method?
Thank you for suggestions and help
Georg
File1
gene1 gene2 gene3 ... gene50,000
Sample1 12 300 70 4000
Sample2 25 100 53 4500
Sample3 70 30 71 2000
...
Sample300 18 200 97 1765
File2
var1 var2 var3 ... var100
Sample1 5 1 170 200
Sample2 7 3 153 100
Sample3 7 18 130 34
...
Sample300 18 54 197 71
File3-STATUS
status
Sample1 1
Sample2 1
Sample3 0
...
Sample300 1

subtracting data from columns in bash csv

I have several columns in a file. I want to subtract two columns...
They have these form...without decimals...
1.000 900
1.012 1.010
1.015 1.005
1.020 1.010
I need another column in the same file with the subtract
100
2
10
10
I have tried
awk - F "," '{$16=$4-$2; print $1","$2","$3","$4","$5","$6}'
but it gives me...
0.100
0.002
0.010
0.010
Any indication?
Using this awk:
awk -v OFS='\t' '{p=$1;q=$2;sub(/\./, "", p); sub(/\./, "", q); print $0, (p-q)}' file
1.000 900 100
1.012 1.010 2
1.015 1.005 10
1.020 1.010 10
Using perl:
perl -lanE '$,="\t",($x,$y)=map{s/\.//r}#F;say#F,$x-$y' file
prints:
1.000 900 100
1.012 1.010 2
1.015 1.005 10
1.020 1.010 10

How to use awk and sed to make statistics on a data

I have a raw data file and i want to generate an output file, both of which are shown below. The rule for generation is that column 1 in the output is equivalent to the column 2 of raw file. The column 2 in the output is the mean of last column values in the raw data, the line of which are match. For example, the value1 in the output is (25+24.3846+13.8972+1.33333+1)/5.
#Raw data
4 compiler-compiler 100000 99975 1 25
4 compiler-compiler 100000 99683 13 24.3846
4 compiler-compiler 100000 93649 457 13.8972
4 compiler-compiler 100000 99764 177 1.33333
4 compiler-compiler 100000 99999 1 1
4 compiler-sunflow 100000 99999 1 1
4 compiler-sunflow 100000 99674 11 29.6364
4 compiler-sunflow 100000 93467 423 15.4444
4 compiler-sunflow 100000 99694 159 1.92453
4 compiler-sunflow 100000 99938 4 15.5
4 compress 100000 99997 1 3
4 compress 100000 99653 10 34.7
4 compress 100000 93639 454 14.011
4 compress 100000 99666 173 1.93064
4 compress 100000 99978 4 5.5
4 serial 100000 99998 1 2
4 serial 100000 99932 6 11.3333
4 serial 100000 93068 460 15.0696
4 serial 100000 99264 206 3.57282
4 serial 100000 99997 3 1
4 sunflow 100000 99998 1 2
4 sunflow 100000 99546 18 25.2222
4 sunflow 100000 93387 481 13.7484
4 sunflow 100000 99752 189 1.31217
4 sunflow 100000 99974 4 6.5
4 xml-transfomer 100000 99994 1 6
4 xml-transfomer 100000 99964 3 12
4 xml-transfomer 100000 93621 463 13.7775
4 xml-transfomer 100000 99540 199 2.31156
4 xml-transfomer 100000 99986 2 7
4 xml-validation 100000 99996 1 4
4 xml-validation 100000 99563 16 27.3125
4 xml-validation 100000 93748 451 13.8625
4 xml-validation 100000 99716 190 1.49474
4 xml-validation 100000 99979 3 7
#Output data
compiler-compiler value1
....
xml-transfomer value2
xml-validation value3
I think the comment awk & sed can work for this, but i do not know how get it.
sed cannot being used here since it does not support math operations. It's a job for awk:
awk 'NR>1{c[$2]++;s[$2]+=$(NF)}END{for(i in c){print i,s[i]/c[i]}}' input.txt
Explanation:
NR>1 { c[$2]++; s[$2+=($NF) }
NR>1 means that the following block gets executed on all lines except of the first line. $2 is the value of the second column. NF is the number of fields per line. $(NF) contains the value of the last column. c and s are assoc arrays. c counts the occurrences of $2, c stores a total of the numeric value in the last column - grouped by $2.
END {for(i in c){print i,s[i]/c[i]}}
END means the following action will take place after the last line of input has been processed. The for loop iterates through c and outputs the name and the mean for all indexes in c.
Output:
xml-validation 10.7339
compiler-compiler 13.123
serial 6.59514
sunflow 9.75655
xml-transfomer 8.21781
compiler-sunflow 12.7011
compress 11.8283
Note that you have influence on the output order if using an assoc aray. If you care about the output order you might use the following command:
awk 'NR>1 && $2!=n && c {print n,t/c;c=t=0} NR>1{n=$2;c++;t+=$(NF)}'
This command does not use assoc arrays, it prints out the stats just in time when $2 changes - note that this requires to sorted by $2

Unix sort using unknown delimiter (last column)

My data looks like this:
Adelaide Crows 5 2 3 0 450 455 460.67 8
Essendon 5 5 0 0 622 352 955.88 20
Fremantle 5 3 2 0 439 428 598.50 12
As you can tell, there is a mixture of spaces and tabs. I need to be able to sort the last column descending. So the ouput looks like this:
Essendon 5 5 0 0 622 352 955.88 20
Fremantle 5 3 2 0 439 428 598.50 12
Adelaide Crows 5 2 3 0 450 455 460.67 8
The whole data consists of all AFL teams.
Using sort how can I achieve this. Am i right in attempting to use the $ character to start from the end of line? I also need to sort the second last column after sorting the last column. Therefore any duplicate numbers in the last column will be sorted in the 2nd last column. Code so far:
sort -n -t$'\t' -k 9,9 -k 8,8 tmp
How do I take into account that the football team names will count as whitespace?
Here is the file being sorted (filename: 'tmp') sample data
You could copy the last field into the first position using awk first, then sort by the first field, the get rid of the first field using cut.
awk '{print($NF" "$0)}' sample.txt | sort -k1,1 -n -r -t' ' | cut -f2- -d' '
Port Adelaide 5 5 0 0 573 386 916.05 20
Essendon 5 5 0 0 622 352 955.88 20
Sydney Swans 5 4 1 0 533 428 681.68 16
Hawthorn 5 4 1 0 596 453 620.64 16
Richmond 5 3 2 0 499 445 579.68 12
..
..

Resources