I am running a for-loop over several files located in a directory. Each command within the for-loop appends the former .txt with a new column. Currently, the 3rd line in the for loop creates a column with the filepath, but I want just the filename - I don't need the file extension either. I've played around with splitting and piping back into awk, but no luck.
After adjusting the awk command to get just the filename, I want to then make a master .txt file that contains all looped values. Essentially I think I would need to append a .txt file with the output from each loop. Right now that's what I'm trying to do with the pipe in the third line of the for loop, but it just creates an empty .txt file.
for file in ~/Desktop/test/*bam
do
bedtools multicov -bams "$file" -bed bed_for_multicov.bed > "${file%%_RRemoved.bam}_counts.txt"
awk '{print $0,a}' a="$(samtools view -c "$file")" ${file%%_RRemoved.bam}_counts.txt > ${file%%_RRemoved.bam}_CPMcounts.txt
awk -v var="$file" '{print $0, var}' ${file%%_RRemoved.bam}_CPMcounts.txt > ${file%%_RRemoved.bam}_CPMcounts2.txt | >> CPMcountsMaster.txt
done
Current filename1_CPMcounts2.txt output
chr1 11088 11488 peak_1 192 4409922 path/to/filename1.bam
chr1 20674 21215 peak_2 217 4409922 path/to/filename1.bam
chr1 28550 28862 peak_3 170 4409922 path/to/filename1.bam
chr1 29582 30300 peak_4 437 4409922 path/to/filename1.bam
chr1 30635 31720 peak_5 696 4409922 path/to/filename1.bam
chr1 32373 35541 peak_6 2877 4409922 path/to/filename1.bam
Current filename2_CPMcounts2.txt output
chr1 11088 11488 peak_1 293 5888360 path/to/filename2.bam
chr1 20674 21215 peak_2 439 5888360 path/to/filename2.bam
chr1 28550 28862 peak_3 392 5888360 path/to/filename2.bam
chr1 29582 30300 peak_4 901 5888360 path/to/filename2.bam
Desired filename1_CPMCounts2.txt output
chr1 11088 11488 peak_1 192 4409922 filename1
chr1 20674 21215 peak_2 217 4409922 filename1
chr1 28550 28862 peak_3 170 4409922 filename1
chr1 29582 30300 peak_4 437 4409922 filename1
chr1 30635 31720 peak_5 696 4409922 filename1
chr1 32373 35541 peak_6 2877 4409922 filename1
Desired Final CPMcountsMaster.txt
chr1 11088 11488 peak_1 192 4409922 filename1
chr1 20674 21215 peak_2 217 4409922 filename1
chr1 28550 28862 peak_3 170 4409922 filename1
chr1 29582 30300 peak_4 437 4409922 filename1
chr1 30635 31720 peak_5 696 4409922 filename1
chr1 32373 35541 peak_6 2877 4409922 filename1
chr1 11088 11488 peak_1 293 5888360 filename2
chr1 20674 21215 peak_2 439 5888360 filename2
chr1 28550 28862 peak_3 392 5888360 filename2
chr1 29582 30300 peak_4 901 5888360 filename2
The following works, adapted from J Leffler's comment - thanks!
for file in ~/Desktop/test/*bam
do
bedtools multicov -bams "$file" -bed bed_for_multicov.bed > "${file%%_RRemoved.bam}_counts.txt"
awk '{print $0,a}' a="$(basename "$file" _RRemoved.bam)" ${file%%_RRemoved.bam}_CPMcounts.txt > ${file%%_RRemoved.bam}_CPMcounts2.txt
awk '{print $0,a}' a="$(basename "$file" _RRemoved.bam)" ${file%%_RRemoved.bam}_CPMcounts.txt >> CPMcountsMaster.txt
done
Related
The file I want to edit looks like this:
chr1 24809154 24809669
chr1 24546969 24547563
chr1 7932037 7932594
chr3 42012155 42012598
chr3 923035 923549
chr4 5799575 5799990
chr4 6895845 6896348
chr4 2337251 2337743
chr5 10715994 10716426
chr5 4385445 4385878
And I have another reference table file that has alternative values for the first column:
chr1 scaffold_A
chr2 scaffold_B
chr3 scaffold_C
chr4 scaffold_D
chr5 scaffold_E
How can I take the values in the reference table to rename values in the first table so the final output is:
scaffold_A 24809154 24809669
scaffold_A 24546969 24547563
scaffold_A 7932037 7932594
scaffold_C 42012155 42012598
scaffold_C 923035 923549
scaffold_D 5799575 5799990
scaffold_D 6895845 6896348
scaffold_D 2337251 2337743
scaffold_E 10715994 10716426
scaffold_E 4385445 4385878
Using awk
$ awk 'NR==FNR {a[$1]=$2;next}{$1=a[$1]}1' reference.table file1
scaffold_A 24809154 24809669
scaffold_A 24546969 24547563
scaffold_A 7932037 7932594
scaffold_C 42012155 42012598
scaffold_C 923035 923549
scaffold_D 5799575 5799990
scaffold_D 6895845 6896348
scaffold_D 2337251 2337743
scaffold_E 10715994 10716426
scaffold_E 4385445 4385878
I think the easiest way is to write a script that will loop through the table line by line, get the first field (field1 below), then get its substituting value (subs1 below) and finally make the substitution using sed on a copy of the table (renamed.txt):
#!/bin/bash
cp "table.txt" "renamed.txt"
while read -r c1 _; do
subs1=$(grep -m1 ${c1} "ref.txt" | awk '{print $2;}')
sed -i "s/${c1}/${subs1}/" "renamed.txt"
done < "table.txt"
Testing:
$ cat table.txt
chr1 24809154 24809669
chr1 24546969 24547563
chr1 7932037 7932594
chr3 42012155 42012598
chr3 923035 923549
chr4 5799575 5799990
chr4 6895845 6896348
chr4 2337251 2337743
chr5 10715994 10716426
chr5 4385445 4385878
$ cat ref.txt
chr1 scaffold_A
chr2 scaffold_B
chr3 scaffold_C
chr4 scaffold_D
chr5 scaffold_E
$ ./rename_table.sh
$ cat renamed.txt
scaffold_A 24809154 24809669
scaffold_A 24546969 24547563
scaffold_A 7932037 7932594
scaffold_C 42012155 42012598
scaffold_C 923035 923549
scaffold_D 5799575 5799990
scaffold_D 6895845 6896348
scaffold_D 2337251 2337743
scaffold_E 10715994 10716426
scaffold_E 4385445 4385878
Using join command:
$ join table.txt ref.txt -o '2.2 1.2 1.3'
scaffold_A 24809154 24809669
scaffold_A 24546969 24547563
scaffold_A 7932037 7932594
scaffold_C 42012155 42012598
scaffold_C 923035 923549
scaffold_D 5799575 5799990
scaffold_D 6895845 6896348
scaffold_D 2337251 2337743
scaffold_E 10715994 10716426
scaffold_E 4385445 4385878
In which the parameter -o '2.2 1.2 1.3' is in the form of "FILENUM.FIELD", that is, print the second column from second file (ref.txt), second column from first file (table.txt) and third column from first file.
As join requires the common column to be sorted, this could be used for unsorted files:
$ join <(sort table.txt) <(sort ref.txt) -o '2.2 1.2 1.3'
EDIT: Simplified while loop by adding column variable as a parameter to read command, as suggested by #shellter. Also as very well suggested, added a version using join command.
I have two tab delimitated fires as follow:
File1:
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
File2:
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
I want to merge these two files in a unique file if both values in columns 3 and 4 of file1 are equal to columns 1 and 2 of file2 and to keep all columns of file2 plus column 2 of file1.
output like this:
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
Thank you so much,
Vahid.
I tried this awk command:
awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' file1.tsv file2.tsv
Bu it does not give me the unique output I an looking for and the out put is a combination of both files like this:
chr1 10468 10470 2 100 cg00000292 0.780482425 0.78
chr1 10483 10496 4 264 0.924
chr3 10524 10524 1 47 cg00008493 0.982994402 0.936
chr1 10541 10541 1 64 0.781
chr3 10562 10588 5 510 0.941
chr1 10608 10619 3 243 0.951
chr7 10630 10794 42 5292 cg00006414 0.000000456 0.952
chr1 10810 10815 3 135 0.756
The basic idea here to to read the first file, and using each line's third and fourth columns as a key, save the second column in an array. Then for each line in the second file, if its first two columns were seen in the first file, print that line and the saved second column of the first file.
$ awk 'BEGIN{ FS=OFS="\t" }
NR==FNR { seen[$3,$4]=$2; next }
($1,$2) in seen { print $0, seen[$1,$2] }' file1.tsv file2.tsv
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
# I want to merge these two files in a unique file
# if both values in columns 3 and 4 of file1
# are equal to columns 1 and 2 of file2
# and to keep all columns of file2 plus column 2 of file1.
join -t$'\t' -11 -21 -o2.2,2.3,2.4,2.5,2.6,2.7,2.8,1.3 <(
<file1 awk -vFS=$'\t' -vOFS=$'\t' '{ print $3 $4,$0 }' |
sort -t$'\t' -k1,1
) <(
<file2 awk -vFS=$'\t' -vOFS=$'\t' '{ print $1 $2,$0 }' |
sort -t$'\t' -k1,1
)
First preprocess the files and extract the fields you want to join on.
Sort and join
Specify the output format to join.
Tested on repl against:
# recreate input files
tr -s ' ' <<EOF | tr ' ' '\t' >file1
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
EOF
tr -s ' ' <<EOF | tr ' ' '\t' >file2
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
EOF
In the below awk I am trying to combine all matching $4 into a single $5 (up to the -), and average all values in $7. Why is the awk complaining about the output not being foung (that is the /home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/${pref}_genes.txt). Thank you :).
input (`/home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/*30reads_perbase.txt')
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 4 14
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 1 28
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 2 27
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 3 27
desired output
chr1:955543-955763 4 AGRN 15
chr1:976035-976270 3 AGRN 27
awk
for f in /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/*30reads_perbase.txt ; do bname=`basename "$f"`; pref=${bname%%.txt}; awk '{k=$4 FS $5; a[k]+=$7; c[k]++}
END{for(k in a)
split(k,ks,FS);
print ks[1],c[k],ks[2],a[k]/c[k]}' "$f" > /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/"${pref}"_genes.txt; done
current output
chr1:976035-976270 3 AGRN 27.3333
Using the functions substr and match when you are printing the variables:
cat | awk '{k=$4 FS $5; a[k]+=$7; c[k]++}END{for(k in a)split(k,ks,FS);print ks[1],c[k],substr(ks[2],0,match(ks[2],"-")-1),a[k]/c[k]}'
chr1:955543-955763 4 AGRN 15.25
I'm trying to filter one file based on two columns of another one.
The problem is that awk is not differentiating, for example, this interval 70083 83083, from position 7323573 (please see below).
The aim is to retrieve the value for file 1 that is in the column 5 of file 2.
File 1 has only one position in the column 3 ex: 51476, and the file 2 has an interval represented by column 3 and 4.
In the end I need the file 1 with respective values of the column 5 (see output).
file 1
rs187298206 chr1 51476 0.0072 0.201426626822702
rs116400033 chr1 51479 0.2055 1.18445621536109
rs62637813 chr1 52058 0.0587 0.551216300225955
rs190291950 chr1 52144 -4e-04 0.036575951491895
rs150021059 chr1 52238 0.3325 1.70427928591544
rs140052487 chr1 54353 0.003 0.12778378962414
rs146477069 chr1 54421 0.1419 0.924336309646664
rs141149254 chr1 54490 0.1767 1.06786868821145
rs2462492 chr1 54676 0.0819 0.664355314594874
rs143174675 chr1 54753 0.026 0.356836206987615
rs3091274 chr1 55164 0.3548 1.80091078751368
rs10399749 chr1 55299 0.0309 0.389748348495465
rs182462964 chr1 55313 2e-04 0.0877969207975495
rs3107975 chr1 55326 0.0237 0.344080010917931
rs142800240 chr1 7323573 -6e-04 0.0361473609720785
file 2
51083_1 chr1 51083 56000 -0.177152387075888 0.172569306719619
57083_1 chr1 57083 60083 -0.0524335467819781 0.130497858911419
60083_1 chr1 70083 83083 -0.0332555672564894 0.124932838766226
525083_1 chr1 525083 528083 0.291406335374442 0.0577249392691202
528083_1 chr1 528083 531083 0.291406335374442 0.0577249392691202
531083_1 chr1 531083 534083 0.291406335374442 0.0577249392691202
534083_1 chr1 534083 537083 0.291406335374442 0.0577249392691202
534083_1 chr1 534083 537083 0.441406335374442 0.0577249392691202
What I get with this script:
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
if (key > $3 && key < $4)
print score[key], $5
}
' file1 file2 > output
output
rs140052487 chr1 54353 0.003 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
rs142800240 chr1 7323573 -6e-04 -0.0332555672564894 <- this should not appear
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
if (key+0 > $3 && key+0 < $4)
print score[key], $5
}
' fst.txt tajima.txt > output
gives me
[/tmp]$ cat output
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs140052487 chr1 54353 0.003 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
to force the interpretation as a number, add 0 to it. from the man page for awk.
I can reproduce your problem on Mac OS X 10.11.3 with the system's BSD awk.
The problem is to do with string vs number comparison; awk appears to be treating the key as a string and is doing a string comparison rather than a numerical comparison.
I've brute-forced it into treating the comparison numerically with:
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
{
if (key+0 > $3+0 && key+0 < $4+0)
{
#print "==", key, $3, $4
#if (key > $3) print key, ">", $3
#if (key < $4) print key, "<", $4
print score[key], $5
}
}
}
' file1 file2
You can see the '+0' to force awk to treat things as numbers. (The analogue to force awk to treat a value as a string is, for example, key "", which concatenates an empty string to the (string) value of key.)
With your sample data, I then get the output:
rs140052487 chr1 54353 0.003 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
Part of the debugging output, which gave the game away, was:
== 54676 51083 56000
54676 > 51083
54676 < 56000
rs2462492 chr1 54676 0.0819 -0.177152387075888
== 7323573 70083 83083
7323573 > 70083
7323573 < 83083
rs142800240 chr1 7323573 -6e-04 -0.0332555672564894
For the 5-digit strings, the comparison happened to work the same as a numeric comparison. For the other, it did not. I should also point out that the $3+0 and $4+0 parts are probably not essential. I had those when I got the debugging output shown — but the tests only started to work when I added 0 to the key. I probably don't need to add the 0 to $3 or $4, therefore.
For example, I have this chromosome file:
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
...
I'd like to remove the third line because Region2 appeared on line 2. I would greatly appreciate any suggestion. Thank you!
Assuming you've got a tab delimiter, this should work using awk:
awk -F'\t' '!x[$4]++' file.txt
If its not tab, just change '\t' to whatever the delimiter is, as by default awk assumes space.
Here's an example showing the results:
input:
~$ cat file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
awk:
awk -F'\t' '!x[$4]++' file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
This works by printing when an element is added to the array that has not been encountered before. It's a pretty standard deduping one-liner just modified to care about a specific field and not the whole line.
It works by adding the 4th field to an associative array and post increments it, so it returns 0 the first time it's added and increments with each subsequent duplicate item in the array. Adding in the ! to reverse this logic, we'll print if the post increment is 0, and not if its anything else, which it will be with each subsequent duplicate addition.
For example, adding a few more lines to the file:
~$ cat file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
Chr1 499 555 Region2
Chr1 499 555 Region3
Chr1 499 556 Region3
And then changing our print to show the output we're testing:
~$ awk -F'\t' '{print x[$4]++}' file.txt
0
0
1
2
0
1
It should be much more obvious what is happening here.