I am preparing the background gmt file for gProfiler(custom, non-model plant) but what I have is the Gene id on first column, which needs to be in the reverse order. Is there a way in Linux or R to rearrange these data?
**current**
Gene1 GO:A GO:B GO:C GO:D GO:E
Gene2 GO:E GO:F GO:G GO:H GO:I GO:J
Gene3 GO:A GO:D GO:E GO:H GO:J
Gene4 GO:B GO:C GO:E GO:F GO:1J
Gene5 GO:C GO:J
**new**
GO:A Gene1 Gene3
GO:B Gene1 Gene4
GO:C Gene1 Gene4 Gene5
GO:D Gene1 Gene3
GO:E Gene1 Gene2 Gene3 Gene4
GO:F Gene2 Gene4
GO:G Gene2
GO:H Gene2 Gene3
GO:I Gene2
GO:J Gene2 Gene3 Gene4 Gene5
Related
I have a large set of 10 gene lists (each one around a thousand genes) that look like that:
Gene_list 1 Gene_list2 Gene_list 3
Gene1 Gene5 Gene4
Gene2 Gene567893 Gene22
Gene4578963 Gene2 Gene5
and I would like to get something like this
Gene_list Gene_Name
Gene_list 1 Gene1
Gene_list 1 Gene2
Gene_list 1 Gene4578963
Gene_list 2 Gene5
Gene_list 2 Gene567893
Gene_list 2 Gene2
Gene_list 3 Gene22
Gene_list 3 Gene5
I tried gatherd, spread, anything I could found online, but the only option I found was be to create a function like C (gene1, gene2,...) which obviously it is not an option here having so many different gene in each list.
I would like to compare two files to identified common translocations. However, these translocations don't have exactly the same coordinates between the files. So I want to see if the translocation occurs between the same pair of chromosomes (chr1, chr2) and if the coordinates overlap.
Here is an examples for two files:
file_1.txt:
chr1 min1 max1 chr2 min2 max2
1 111111 222222 2 333333 444444
2 777777 888888 3 555555 666666
15 10 100 15 2000 2100
17 500 530 18 700 750
20 123456 234567 20 345678 456789
file_2.txt:
chr1 min1 max1 chr2 min2 max2
1 100000 200000 2 400000 500000
2 800000 900000 3 500000 600000
15 200 300 15 2000 3000
20 150000 200000 20 300000 500000
The objective is that the pair chr1 and chr2 is the same between file 1 and file 2. Then the coordinates min1 and max1 must overlap between the two files. Same thing for min2 and max2.
For the result, perhaps the best solution is to print the two lines as follows:
1 111111 222222 2 333333 444444
1 100000 200000 2 400000 500000
2 777777 888888 3 555555 666666
2 800000 900000 3 500000 600000
20 123456 234567 20 345678 456789
20 150000 200000 20 300000 500000
(For this simplified example, I tried to represent the different types of overlap I could encounter. I hope it is clear enough).
Thank you for your help.
awk to the rescue!
$ awk 'function overlap(x1,y1,x2,y2) {return y1>x2 && y2>x1}
{k=$1 FS $4}
NR==FNR {r[k]=$0; c1min[k]=$2; c1max[k]=$3; c2min[k]=$5; c2max[k]=$6; next}
overlap(c1min[k],c1max[k],$2,$3) &&
overlap(c2min[k],c2max[k],$5,$6) {print r[k] ORS $0 ORS}' file1 file2
1 111111 222222 2 333333 444444
1 100000 200000 2 400000 500000
2 777777 888888 3 555555 666666
2 800000 900000 3 500000 600000
20 123456 234567 20 345678 456789
20 150000 200000 20 300000 500000
assumes the first file can be held in memory and prints an extra empty line at the end.
I am running a for-loop over several files located in a directory. Each command within the for-loop appends the former .txt with a new column. Currently, the 3rd line in the for loop creates a column with the filepath, but I want just the filename - I don't need the file extension either. I've played around with splitting and piping back into awk, but no luck.
After adjusting the awk command to get just the filename, I want to then make a master .txt file that contains all looped values. Essentially I think I would need to append a .txt file with the output from each loop. Right now that's what I'm trying to do with the pipe in the third line of the for loop, but it just creates an empty .txt file.
for file in ~/Desktop/test/*bam
do
bedtools multicov -bams "$file" -bed bed_for_multicov.bed > "${file%%_RRemoved.bam}_counts.txt"
awk '{print $0,a}' a="$(samtools view -c "$file")" ${file%%_RRemoved.bam}_counts.txt > ${file%%_RRemoved.bam}_CPMcounts.txt
awk -v var="$file" '{print $0, var}' ${file%%_RRemoved.bam}_CPMcounts.txt > ${file%%_RRemoved.bam}_CPMcounts2.txt | >> CPMcountsMaster.txt
done
Current filename1_CPMcounts2.txt output
chr1 11088 11488 peak_1 192 4409922 path/to/filename1.bam
chr1 20674 21215 peak_2 217 4409922 path/to/filename1.bam
chr1 28550 28862 peak_3 170 4409922 path/to/filename1.bam
chr1 29582 30300 peak_4 437 4409922 path/to/filename1.bam
chr1 30635 31720 peak_5 696 4409922 path/to/filename1.bam
chr1 32373 35541 peak_6 2877 4409922 path/to/filename1.bam
Current filename2_CPMcounts2.txt output
chr1 11088 11488 peak_1 293 5888360 path/to/filename2.bam
chr1 20674 21215 peak_2 439 5888360 path/to/filename2.bam
chr1 28550 28862 peak_3 392 5888360 path/to/filename2.bam
chr1 29582 30300 peak_4 901 5888360 path/to/filename2.bam
Desired filename1_CPMCounts2.txt output
chr1 11088 11488 peak_1 192 4409922 filename1
chr1 20674 21215 peak_2 217 4409922 filename1
chr1 28550 28862 peak_3 170 4409922 filename1
chr1 29582 30300 peak_4 437 4409922 filename1
chr1 30635 31720 peak_5 696 4409922 filename1
chr1 32373 35541 peak_6 2877 4409922 filename1
Desired Final CPMcountsMaster.txt
chr1 11088 11488 peak_1 192 4409922 filename1
chr1 20674 21215 peak_2 217 4409922 filename1
chr1 28550 28862 peak_3 170 4409922 filename1
chr1 29582 30300 peak_4 437 4409922 filename1
chr1 30635 31720 peak_5 696 4409922 filename1
chr1 32373 35541 peak_6 2877 4409922 filename1
chr1 11088 11488 peak_1 293 5888360 filename2
chr1 20674 21215 peak_2 439 5888360 filename2
chr1 28550 28862 peak_3 392 5888360 filename2
chr1 29582 30300 peak_4 901 5888360 filename2
The following works, adapted from J Leffler's comment - thanks!
for file in ~/Desktop/test/*bam
do
bedtools multicov -bams "$file" -bed bed_for_multicov.bed > "${file%%_RRemoved.bam}_counts.txt"
awk '{print $0,a}' a="$(basename "$file" _RRemoved.bam)" ${file%%_RRemoved.bam}_CPMcounts.txt > ${file%%_RRemoved.bam}_CPMcounts2.txt
awk '{print $0,a}' a="$(basename "$file" _RRemoved.bam)" ${file%%_RRemoved.bam}_CPMcounts.txt >> CPMcountsMaster.txt
done
I have two tab delimitated fires as follow:
File1:
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
File2:
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
I want to merge these two files in a unique file if both values in columns 3 and 4 of file1 are equal to columns 1 and 2 of file2 and to keep all columns of file2 plus column 2 of file1.
output like this:
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
Thank you so much,
Vahid.
I tried this awk command:
awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' file1.tsv file2.tsv
Bu it does not give me the unique output I an looking for and the out put is a combination of both files like this:
chr1 10468 10470 2 100 cg00000292 0.780482425 0.78
chr1 10483 10496 4 264 0.924
chr3 10524 10524 1 47 cg00008493 0.982994402 0.936
chr1 10541 10541 1 64 0.781
chr3 10562 10588 5 510 0.941
chr1 10608 10619 3 243 0.951
chr7 10630 10794 42 5292 cg00006414 0.000000456 0.952
chr1 10810 10815 3 135 0.756
The basic idea here to to read the first file, and using each line's third and fourth columns as a key, save the second column in an array. Then for each line in the second file, if its first two columns were seen in the first file, print that line and the saved second column of the first file.
$ awk 'BEGIN{ FS=OFS="\t" }
NR==FNR { seen[$3,$4]=$2; next }
($1,$2) in seen { print $0, seen[$1,$2] }' file1.tsv file2.tsv
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
# I want to merge these two files in a unique file
# if both values in columns 3 and 4 of file1
# are equal to columns 1 and 2 of file2
# and to keep all columns of file2 plus column 2 of file1.
join -t$'\t' -11 -21 -o2.2,2.3,2.4,2.5,2.6,2.7,2.8,1.3 <(
<file1 awk -vFS=$'\t' -vOFS=$'\t' '{ print $3 $4,$0 }' |
sort -t$'\t' -k1,1
) <(
<file2 awk -vFS=$'\t' -vOFS=$'\t' '{ print $1 $2,$0 }' |
sort -t$'\t' -k1,1
)
First preprocess the files and extract the fields you want to join on.
Sort and join
Specify the output format to join.
Tested on repl against:
# recreate input files
tr -s ' ' <<EOF | tr ' ' '\t' >file1
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
EOF
tr -s ' ' <<EOF | tr ' ' '\t' >file2
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
EOF
I have two files, file 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
and file 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 804 804 0.24
I have an awk script that looks in the second file for values that match the first three columns of the first file.
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found
Is there a way to make it such that any rows occurring in file 2 that are not in file 1 are still added at the end of the output, after the matches, such as this:
0.55
0.09
0.88
not found
4 804 804 0.24
That way, when I paste the two files back together, they will look something like this:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 not found
4 804 804 not found 0.04
Or is there any other more elegant solution with completely different syntax?
awk '{k=$1FS$2FS$3}NR==FNR{a[k]=$4;next}
k in a{print $4;next}{print "not found";print}' f1 f2
The above one-liner will give you:
0.55
0.09
0.88
not found
4 804 804 0.24