I am dealing with the analysis of multi-column data organized in the following manner:
#Acceptor DonorH Donor Frames Frac AvgDist AvgAng
lig_608#O1 GLU_166#H GLU_166#N 13731 0.6865 2.8609 160.4598
lig_608#O2 HIE_163#HE2 HIE_163#NE2 8320 0.4160 2.8412 150.3665
lig_608#N2 ASN_142#HD21 ASN_142#ND2 1575 0.0788 2.9141 157.3493
lig_608#N2 THR_25#HG1 THR_25#OG1 218 0.0109 2.8567 156.0376
lig_608#O1 GLN_189#HE22 GLN_189#NE2 72 0.0036 2.8427 157.3778
lig_608#N2 ASN_142#HD22 ASN_142#ND2 43 0.0022 2.9093 165.3063
lig_608#N2 SER_46#HG SER_46#OG 32 0.0016 2.8710 159.8673
lig_608#F1 HIE_41#HE2 HIE_41#NE2 31 0.0015 2.8904 153.0763
lig_608#O2 SER_144#HG SER_144#OG 20 0.0010 2.8147 144.6951
lig_608#N2 THR_24#HG1 THR_24#OG1 16 0.0008 2.8590 165.3937
lig_608#O2 GLY_143#H GLY_143#N 15 0.0008 2.8729 149.1930
lig_608#F1 GLN_189#HE22 GLN_189#NE2 15 0.0008 2.9192 146.2273
lig_608#O2 SER_144#H SER_144#N 10 0.0005 2.9259 148.8008
lig_608#N2 THR_26#H THR_26#N 8 0.0004 2.9491 149.1861
lig_608#O2 GLU_166#H GLU_166#N 4 0.0002 2.8839 150.1238
lig_608#N2 GLN_189#HE21 GLN_189#NE2 3 0.0001 2.9567 153.7993
lig_608#N2 ASN_119#HD21 ASN_119#ND2 2 0.0001 2.8564 147.7916
lig_608#O2 CYS_145#H CYS_145#N 2 0.0001 2.8867 151.6423
lig_608#O1 GLN_189#HE21 GLN_189#NE2 2 0.0001 2.8888 148.3678
lig_608#N2 GLY_143#H GLY_143#N 2 0.0001 2.9658 149.2518
lig_608#F1 GLN_189#HE21 GLN_189#NE2 1 0.0001 2.8675 139.9754
lig_608#F1 GLN_189#H GLN_189#N 1 0.0001 2.8987 168.1758
lig_608#N2 HIE_41#HE2 HIE_41#NE2 1 0.0001 2.9411 147.0443
From this I need to take into account the info from the third column (donor) as well as the fifth column (Frac) and print the 2D histogram of the data taking into account the values (of the fifth column) bigger then 0.01. So in the demonstrated example, only the following data should be considered:
#Donor #Frac
GLU_166#N 0.6865
HIE_163#NE2 0.4160
ASN_142#ND2 0.0788
THR_25#OG1 0.0109
and the 2D histogram should plot # Donor on X and #Frac on Y (in %)
Before I had to add the following lines to the reduced 2D datafile in order that it could be recognized by gracebat as 2D bar plot:
# title "No tittle"
# xaxis label "Donor"
# yaxis label "Frac"
#s0 line type 0
#TYPE bar
# here is the data in 2 column format
Is it possible to automatize such file post-processing to produce the bar plot on-the-fly ? alternatively I would be grateful for sed solution to edit the datafile on the fly to reduce it to 2 columns and insert in the begining # lines required for bar graph ploting using:
sed -i 's/old-text/new-text/g' datafile
sed isn't meant for this kind of task, you should use awk:
awk '
BEGIN {
print "# title \"No title\""
print "# xaxis label \"Donor\""
print "# yaxis label \"Frac\""
print "#s0 line type 0"
print "#TYPE bar"
}
NR > 1 && $5 > 0.01 { print $3, $5 }
' file.txt
You could also do this with an on-the-fly generated Gnuplot script, e.g.:
cat <<EOS | gnuplot > output.png
set term pngcairo size 1280,960
set xtics noenhanced
set xlabel "Frac"
set ylabel "Donor"
set key off
set style fill solid 0.5
set boxwidth 0.9
plot "<awk 'NR == 1 || \$5 > 0.01' infile.tsv" using 0:5:xtic(3) with boxes
EOS
Which results in a png file:
Related
I am looking for a way to replace (with 0) a specific value (1043252782) in a "table-like" string stored as a bash variable. The output of echo "$var"looks like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 1043252782
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
After the replacement echo "$var" should look like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
Is there a way to do this without saving the content of $var to a file and directly manipulating it within the bash (shell script)?
Maby with awk? I can select the value in the 10th field of the second record with awk and pattern matching ("7 Seek_Error_Rate ....") like this:
echo "$var" | awk '/^ 7/{print $10}'
Maby there is some way doing it with awk (or other cli-tool) to replace it and store it back into $var? Also, the value changes over time, but the structure remains the same (some record at the 10th field).
You can change a specific string directly in the shell:
var=${var/1043252782/0}
To replace final number of second line, you could use awk or sed:
var=$(awk 'NR==2 { sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '2s/[0-9][0-9]*$/0/' <<<"$var")
If you don't know which line it will be, you can match a known string:
var=$(awk '/Seek_Error_Rate/{ sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '/Seek_Error_Rate/s/[0-9][0-9]*$/0/' <<<"$var")
You can use a here-string to feed the variable as input to awk.
Use sub() to perform a regular expression replacement.
var=$(awk '{sub(/1043252782$/, "0")}1' <<<"$var")
Using sed
$ var=$(sed '/1043252782$/s//0/' <<< "$var")
$ echo "$var"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
if you don't wanna ruin formatting of tabs and spaces :
{m,g}wk NF=NF FS=' 1043252782$' OFS=' 0'
:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
or doing the whole file in one single shot :
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS='^$' ORS=
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS= -- (This might work too but I'm not too well versed in any side effects for blank RS)
I have two files. If the column "chromosome" matches between the two files and the position of File1 is between the Start_position and the End_position of File2, I would like to associate the two cell_frac values. If the Gene (chromosome + position) is not present in File2, I would like both cell_frac values to be equal to 0.
File1:
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene2 1 222222
Gene3 2 333333
Gene4 2 333337
File2:
Chromosome Start_Position End_Position cell_frac_A1 cell_frac_A2
1 222220 222230 0.12 0.01
2 333330 333340 0.03 0.25
3 444440 444450 0.01 0.01
Desired output:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Edit: Here is the beginning of the code I used for now (not correct output):
awk '
NR==FNR{ range[$1,$2,$3,$4,$5]; next }
FNR==1
{
for(x in range) {
split(x, check, SUBSEP);
if($2==check[1] && $3>=check[2] && $3<=check[3]) { print $1"\t"$2"\t"$3"\t"check[4]"\t"check[5]}
}
}
' File2 File1
However, I did not manage to associate a 0 (with "else") when the gene was not present. I get the wrong number of lines. I Can you give me more hints?
Thanks a lot.
One awk-only idea ...
NOTE: see my other answer for assumptions/understandings and my version of file1
awk ' # process file2
FNR==NR { c=$1 # save chromosome value
$1="" # clear field #1
file2[c]=$0 # use chromosome as array index; save line in array
next
}
# process file1
{ start=end=-9999999999 # default values in case
a1=a2="" # no match in file2
if ($2 in file2) { # if chromosome also in file2
split(file2[$2],arr) # split file2 data into array arr[]
startpos =arr[1]
endpos =arr[2]
a1 =arr[3]
a2 =arr[4]
}
# if not the header row and file1/position outside of file2/range then set a1=a2=0
if (FNR>1 && ($3 < startpos || $3 > endpos)) a1=a2=0
print $0,a1,a2
}
' file2 file1
This generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene1 1 111111 0 0
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
Changing the last line to ' file2 file1 | column -t generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene1 1 111111 0 0
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
Presorting file1 by Chromosome and Position by changing last line to ' file2 <(head -1 file1; tail -n +2 file1 | sort -k2,3) | column -t generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
One big issue (same as with my other answer) ... the actual code may become unweidly when dealing with 519 total columns especially if there's a need to intersperse a lot of columns; otherwise OP may be able to use some for loops to more easily print ranges of columns.
A job for sql instead of awk, perhaps?
tr -s ' ' '|' <File1 >file1.csv
tr -s ' ' '|' <File2 >file2.csv
(
echo 'Hugo_Symbol|Chromosome|Position|cell_frac_A1|cell_frac_A2'
sqlite3 <<'EOD'
.import file1.csv t1
.import file2.csv t2
select distinct
t1.hugo_symbol,
t1.chromosome,
t1.position,
case
when t1.position between t2.start_position and t2.end_position
then t2.cell_frac_a1
else 0
end,
case
when t1.position between t2.start_position and t2.end_position
then t2.cell_frac_a2
else 0
end
from t1 join t2 on t1.chromosome=t2.chromosome;
EOD
rm file[12].csv
) | tr '|' '\t'
Assumptions and/or understandings based on sample data and OP comments ...
file1 is not sorted by chromosome
file2 is sorted by chromosome
common headers in both files are spelled the same (eg, file1:Chromosome vs file2:Chromosom)
if a chromosome exists in file1 but does not exist in file2 then we keep the line from file1 and the columns from file2 are left blank
both files are relatively small (file1: 5MB, 900 lines; file2: few KB, 50 lines)
NOTE: the number of columns (file1: 500 columns; file2: 20 columns) could be problematic from the point of view of cumbersome coding ... more on that later ...
Sample inputs:
$ cat file1 # scrambled chromsome order; added chromosome=4 line
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene3 2 333333
Gene2 1 222222
Gene4 2 333337
Gene5 4 444567 # has no match in file2
$ cat file2
Chromosome Start_Position End_Position cell_frac_A1 cell_frac_A2
1 222220 222230 0.12 0.01
2 333330 333340 0.03 0.25
3 444440 444450 0.01 0.01
First issue is to sort file1 by Chromosome and Position and also keep the header line in place:
$ (head -1 file1; tail -n +2 file1 | sort -k2,3)
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene2 1 222222
Gene3 2 333333
Gene4 2 333337
Gene5 4 444567
We can now join the 2 files based on the Chromosome column:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2
Chromosome Hugo_Symbol Position Start_Position End_Position cell_frac_A1 cell_frac_A2
1 Gene1 111111 222220 222230 0.12 0.01
1 Gene2 222222 222220 222230 0.12 0.01
2 Gene3 333333 333330 333340 0.03 0.25
2 Gene4 333337 333330 333340 0.03 0.25
4 Gene5 444567
Where:
-1 2 -2 1 - join on Chromosome columns: -1 2 == file #1 column #2; -2 1 == file #2 column #1
-a 1 - keep columns from file #1 (sorted file1)
--nocheck-order - disable verifying input is sorted by join column; optional; may be needed if a locale thinks 1 should be sorted before Chromosome
NOTE: for the sample inputs/outputs we don't need a special output format so we can skip the -o option, but this may be needed depending on OP's output requirements for 519 total columns (but it may also become unwieldly)
From here OP can use bash or awk to do comparisons (is column #3 between columns #4/#5); one awk idea:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2 | awk 'FNR>1{if ($3<$4 || $3>$5) $6=$7=0} {print $2,$1,$3,$6,$7}'
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0 # Position outside of range
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0 # no match in file2; if there were other columns from file2 they would be empty
And to match OP's sample output (appears to be a fixed width requirement) we can pass this to column:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2 | awk 'FNR>1{if ($3<$4 || $3>$5) $6=$7=0} {print $2,$1,$3,$6,$7}' | column -t
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
NOTE: Keep in mind this may be untenable with OP's 519 total columns, especially if interspersed columns contain blanks/white-space (ie, column -t may not parse the input properly)
Issues (in addition to incorrect assumptions and previous NOTES):
for relatively small files the performance of the join | awk | column should be sufficient
for larger files all of this code can be rolled into a single awk solution though memory usage could be an issue on a small machine (eg, one awk idea would be to load file2 into memory via arrays so memory would need to be large enough to hold all of file2 ... probably not an issue unless file2 gets to be 100's/1000's of MBytes in size)
for 519 total columns the awk/print will get unwieldly especially if there's a need to move/intersperse a lot of columns
I have a huge file (hundreds of lines, ca. 4,000 columns) structured like this
locus 1 1 1 2 2 3 3 3
exon 1 2 3 1 2 1 2 3
data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
and I need to calculate mean from all values (on each data line separately) with the same locus number (i.e., the same number in the first line), i.e.
data1: mean from first three values (three columns with locus '1':
17.07, 7.11, 10.58), next two values (10.21, 19.34) and next three values (14.69, 3.32, 21.07)
I would like to have output like this
data1 mean1 mean2 mean3
data1 mean1 mean2 mean3
I was thinking about using bash and awk...
Thank you for your advice.
You can use GNU datamash version 1.1.0 or newer (I used last version - 1.1.1):
#!/bin/bash
lines=$(wc -l < "$1")
datamash -W transpose < "$1" |
datamash -H groupby 1 mean 3-"$lines" |
datamash transpose
Usage: mean_value.sh input.txt | column -t (column -t needed for pretty view, it is not necessary)
Output:
GroupBy(locus) 1 2 3
mean(data1) 11.586666666667 14.775 13.026666666667
mean(data2) 13.586666666667 18.565 10.933333333333
if it was me, i would use R, not awk:
library(data.table)
x = fread('data.txt')
#> x
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1: locus 1.00 1.00 1.00 2.00 2.00 3.00 3.00 3.00
#2: exon 1.00 2.00 3.00 1.00 2.00 1.00 2.00 3.00
#3: data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
#4: data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
# save first column of names for later
cnames = x$V1
# remove first column
x[,V1:=NULL]
# matrix transpose: makes rows into columns
x = t(x)
# convert back from matrix to data.table
x = data.table(x,keep.rownames=F)
# set the column names
colnames(x) = cnames
#> x
# locus exon data1 data2
#1: 1 1 17.07 21.42
#...
# ditch useless column
x[,exon:=NULL]
#> x
# locus data1 data2
#1: 1 17.07 21.42
# apply mean() function to each column, grouped by locus
x[,lapply(.SD,mean),locus]
# locus data1 data2
#1: 1 11.58667 13.58667
#2: 2 14.77500 18.56500
#3: 3 13.02667 10.93333
for convenience, here's the whole thing again without comments:
library(data.table)
x = fread('data.txt')
cnames = x$V1
x[,V1:=NULL]
x = t(x)
x = data.table(x,keep.rownames=F)
colnames(x) = cnames
x[,exon:=NULL]
x[,lapply(.SD,mean),locus]
awk ' NR==1{for(i=2;i<NF+1;i++) multi[i]=$i}
NR>2{
for(i in multi)
{
data[multi[i]] = 0
count[multi[i]] = 0
}
for(i=2;i<NF+1;i++)
{
data[multi[i]] += $i
count[multi[i]] += 1
};
printf "%s ",$1;
for(i in data)
printf "%s ", data[i]/count[i];
print ""
}' <file_name>
Replace <file_name> with your data file
I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0
I have file with the following content.
e.g. 2 images with FRAGMENT size. I want to calculate the fragments total size in bash script.
IMAGE admindb1 8 admindb1_1514997916 bus 4 Default-Application-Backup 2 3 1 1517676316 0 0
FRAG 1 1 10784 0 2 6 2 HSBRQ2 fuj 65536 329579 1514995208 60 0 *NULL* 1517676316 0 3 1 *NULL*
IMAGE admindb1 8 admindb1_1514995211 bus 4 Default-Application-Backup 2 3 1 1517673611 0 0
FRAG 1 1 13168256 0 2 6 12 HSBQ8I fuj 65536 173783 1514316708 52 0 *NULL* 1517673611 0 3 1 *NULL*
FRAG 1 2 24288384 0 2 6 1 HSBRJ7 fuj 65536 2 1514995211 65 0 *NULL* 0 0 3 1 *NULL*
FRAG 1 3 24288384 0 2 6 1 HSBRON fuj 65536 2 1514995211 71 0 *NULL* 0 0 3 1 *NULL*
FRAG 1 4 13806752 0 2 6 1 HSBRRK fuj 65536 2 1514995211 49 0 *NULL* 0 0 3 1 *NULL*
Output should be like this:
For Image admindb1_1514997916 total size is 10784
For Image admindb1_1514995211 total size is 75551776
4th column at line which is beginning with FRAG should be calculated.
My script is not working:
#!/bin/bash
file1=/home/turgun/Desktop/IBteck/script-last/frags
imagelist=/home/turgun/Desktop/IBteck/script-last/imagelist
counter=1
for counter in `cat $imagelist`
do
n=`awk '/'$counter'/{ print NR; exit }' $file1`
for n in `cat $file1`
do
if [[ ! $n = 'IMAGE' ]]; then
echo "For Image $counter total size is " \
`grep FRAG frags | awk '{print total+=$4}'`
fi
done
done
awk 'function summary() {print "For Image",image,"total size is",sum}
$1=="IMAGE" {image=$4}
$1=="FRAG" {sum+=$4}
$1=="" {summary(); sum=0}
END{summary()}' file
Output:
For Image admindb1_1514997916 total size is 10784
For Image admindb1_1514995211 total size is 75551776
I assume that the last line is not empty.
cat -s file.txt | sed -n '/^IMAGE /{s:^[^ ]* *[^ ]* *[^ ]* *\([^ ]*\).*$:echo -n "For image \1 total size is "; echo `echo ":;s:\*::g;p};/^$/{s:^:0"`|bc:;p};/^FRAG /{s:^[^ ]* *[^ ]* *[^ ]* *\([^ ]*\).*$:\1\+:;s:\*::g;p};' | bash
Output:
For image admindb1_1514997916 total size is 10784
For image admindb1_1514995211 total size is 75551776
Awk solution:
awk '/^IMAGE/{
if (t) { printf "For image %s total size is %d\n",img,t; t=0 } img=$4
}
/^FRAG/{ t+=$4 }
END{ if (t) printf "For image %s total size is %d\n",img,t }' file
The output:
For image admindb1_1514997916 total size is 10784
For image admindb1_1514995211 total size is 75551776
Gnarly combo of cut, GNU sed, (with a careless use of evaluate), and datamash:
cut -d' ' -f4 file | datamash --no-strict transpose |
sed 's#\tN/A\t#\n#g;y#\t# #' |
sed 's/ \(.*\)/ $((\1))/;y/ /+/;s/+/ /;
s/\(.*\) \(.*\)/echo For image \1 total size is \2/e'
Output:
For image admindb1_1514997916 total size is 10784
For image admindb1_1514995211 total size is 75551776
Cyrus's answer is just better. This answer shows some limits of sed. Also if the data file is huge, (say millions of numbers to sum), the evaluate used to farm out the addition to the shell would probably exceed its command line length limit.