subtracting data from columns in bash csv

subtracting data from columns in bash csv - bash

I have several columns in a file. I want to subtract two columns...
They have these form...without decimals...
1.000 900
1.012 1.010
1.015 1.005
1.020 1.010
I need another column in the same file with the subtract
100
2
10
10
I have tried
awk - F "," '{$16=$4-$2; print $1","$2","$3","$4","$5","$6}'
but it gives me...
0.100
0.002
0.010
0.010
Any indication?

Using this awk:
awk -v OFS='\t' '{p=$1;q=$2;sub(/\./, "", p); sub(/\./, "", q); print $0, (p-q)}' file
1.000 900 100
1.012 1.010 2
1.015 1.005 10
1.020 1.010 10

Using perl:
perl -lanE '$,="\t",($x,$y)=map{s/\.//r}#F;say#F,$x-$y' file
prints:
1.000 900 100
1.012 1.010 2
1.015 1.005 10
1.020 1.010 10

Related

Bash/Awk: Find common translocations in two files using overlapping coordinates

I would like to compare two files to identified common translocations. However, these translocations don't have exactly the same coordinates between the files. So I want to see if the translocation occurs between the same pair of chromosomes (chr1, chr2) and if the coordinates overlap.
Here is an examples for two files:
file_1.txt:
chr1 min1 max1 chr2 min2 max2
1 111111 222222 2 333333 444444
2 777777 888888 3 555555 666666
15 10 100 15 2000 2100
17 500 530 18 700 750
20 123456 234567 20 345678 456789
file_2.txt:
chr1 min1 max1 chr2 min2 max2
1 100000 200000 2 400000 500000
2 800000 900000 3 500000 600000
15 200 300 15 2000 3000
20 150000 200000 20 300000 500000
The objective is that the pair chr1 and chr2 is the same between file 1 and file 2. Then the coordinates min1 and max1 must overlap between the two files. Same thing for min2 and max2.
For the result, perhaps the best solution is to print the two lines as follows:
1 111111 222222 2 333333 444444
1 100000 200000 2 400000 500000
2 777777 888888 3 555555 666666
2 800000 900000 3 500000 600000
20 123456 234567 20 345678 456789
20 150000 200000 20 300000 500000
(For this simplified example, I tried to represent the different types of overlap I could encounter. I hope it is clear enough).
Thank you for your help.

awk to the rescue!
$ awk 'function overlap(x1,y1,x2,y2) {return y1>x2 && y2>x1}
{k=$1 FS $4}
NR==FNR {r[k]=$0; c1min[k]=$2; c1max[k]=$3; c2min[k]=$5; c2max[k]=$6; next}
overlap(c1min[k],c1max[k],$2,$3) &&
overlap(c2min[k],c2max[k],$5,$6) {print r[k] ORS $0 ORS}' file1 file2
1 111111 222222 2 333333 444444
1 100000 200000 2 400000 500000
2 777777 888888 3 555555 666666
2 800000 900000 3 500000 600000
20 123456 234567 20 345678 456789
20 150000 200000 20 300000 500000
assumes the first file can be held in memory and prints an extra empty line at the end.

how to get the mean of a column between two patterns iteratively using awk.

I have a large file with this format
#1995GO
CCD3 0.099 -0.008 0.047 0.019 2 2 4
CCD7 0.090 -0.040 0.000 0.000 1 1 4
#
#1995SM55
CCD3 0.174 0.026 0.026 0.047 4 4 10
CCD7 0.157 0.006 0.015 0.011 5 5 10
#
#1999TC36
CCD3 0.080 0.019 0.008 0.001 2 2 4
CCD7 0.085 0.032 0.004 0.014 2 2 4
#
I want to get the mean of the column 4 of each content between # #. For example for the first I want to print ((0.047 + 0.000 )/ 2).

Short awk approach:
awk '/^#[0-9]/{ f=1;next }/^#[[:space:]]*$/{ print s/c; f=c=s=0 }f{ ++c; s+=$4 }' file
The output:
0.0235
0.0205
0.006

Here is one approach using awk:-
awk '
!/^#/ {
T += $4
C += 1
}
/^#$/ {
printf ( "%.3f\n", ( T /C ) )
T = C = 0
}
' file

AWK print group of lines twice if conditions met

I am having difficulties writing an AWK statement that would print out a group of lines twice under specified conditions, with the option to change values in the lines being repeated. For example, if the first field of a row is 11 ($1=11), then I would like to print out that row and the row that follows twice (adjusting the value in the second column ($2).
So far this is what I have, but it does not replicate lines with the first field = to 11 and the following line.
awk '{if(NF<3) print $0; if(NF==3 && $1==11) print $0, 1, 20; if(NF==3 && $1 != 11) print $0, 0, 0; if(NF>3) print $0;}'
Example Input
1 3
6 0.1 99
0.100 0.110 0.111
7 0.4 88
0.200 0.220 0.222
11 0.5 77
0.300 0.330 0.333
2 2
7 0.3 66
0.400 0.440 0.444
11 0.7 55
0.500 0.550 0.555
This is a simplified version of what I would like to do, so let's just say for simplicity I would like the printed NR where $1==11 and following row (NR+1) to have the value in the second column ($2) be half of the original value. Example, for the grouping of rows under the 1 3 section, the value after 11 is 0.5. Ideally, the rows printed out would have the value following 11 to be 0.25.
Ideal Output
1 3
6 0.1 99 0 0
0.100 0.110 0.111
7 0.4 88 0 0
0.200 0.220 0.222
11 0.25 77 1 20
0.300 0.330 0.333
11 0.25 77 1 20
0.300 0.330 0.333
2 2
7 0.3 66 0 0
0.400 0.440 0.444
11 0.35 55 1 20
0.500 0.550 0.555
11 0.35 55 1 20
0.500 0.550 0.555

With GNU awk for gensub() and \s/\S:
$ awk '$1==11{$0=gensub(/^(\s+\S+\s+)\S+/,"\\1"$2/2,1); c=2; s=$0} {print} c&&!--c{print s ORS $0}' file
1 3
6 0.1 99
0.100 0.110 0.111
7 0.4 88
0.200 0.220 0.222
11 0.25 77
0.300 0.330 0.333
11 0.25 77
0.300 0.330 0.333
2 2
7 0.3 66
0.400 0.440 0.444
11 0.35 55
0.500 0.550 0.555
11 0.35 55
0.500 0.550 0.555

You can use the following awk script. (P.S. There are leading and trailing space in your input file. That's why I had to use NF>2 && NF<=5 rather than NF==3.)
BEGIN {
c=0;FS="[ \t]+";OFS=" ";x="";y="";
}
c==2{
print x, 1, 20;
print y;
c=0;
}
NF ==2{
print $0;
}
NF>2 && NF<=5{
if(c==1){
print $0;
y=$0;c=2;next;
}
if($2==11){
print $0, 1, 20;
x=$0;c=1;
}
else print $0;
}
NF>5{
print $0,"hello";
}
END{
if(c==2){
print x, 1, 20;
print y;
}
}

Vlookup in awk: how to list anything occuring in file2 but not in file1 at the end of output?

I have two files, file 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
and file 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 804 804 0.24
I have an awk script that looks in the second file for values that match the first three columns of the first file.
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found
Is there a way to make it such that any rows occurring in file 2 that are not in file 1 are still added at the end of the output, after the matches, such as this:
0.55
0.09
0.88
not found
4 804 804 0.24
That way, when I paste the two files back together, they will look something like this:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 not found
4 804 804 not found 0.04
Or is there any other more elegant solution with completely different syntax?

awk '{k=$1FS$2FS$3}NR==FNR{a[k]=$4;next}
k in a{print $4;next}{print "not found";print}' f1 f2
The above one-liner will give you:
0.55
0.09
0.88
not found
4 804 804 0.24

output two commands on same line

I have seen this question a few times, but the solutions I cannot get to work.
I have the following command:
printf '%s\n' "${fa[#]}" | xargs -n 3 bash -c 'cat *-$2.ss | sed -n 11,1p ; echo $0 $1 $2;'
where
printf '%s\n' "${fa[#]}"
O00238 115 03
O00238 126 04
and cat *-$2.ss gives:
1 D C 0.999 0.000 0.000
2 L C 0.940 0.034 0.012
3 H C 0.971 0.005 0.015
4 P C 0.977 0.005 0.009
5 T C 0.970 0.009 0.018
6 L C 0.977 0.006 0.011
7 P C 0.864 0.027 0.014
8 P C 0.966 0.018 0.011
9 L C 0.920 0.038 0.039
10 K C 0.924 0.043 0.039
11 D C 0.935 0.036 0.035
12 R C 0.934 0.023 0.053
13 D C 0.932 0.022 0.046
14 F C 0.878 0.041 0.088
15 V C 0.805 0.031 0.198
16 D C 0.834 0.039 0.108
17 G C 0.882 0.019 0.071
18 P C 0.800 0.031 0.132
19 I C 0.893 0.039 0.070
20 H C 0.823 0.024 0.179
21 H C 0.920 0.026 0.070
22 R C 0.996 0.001 0.002
running the command then produces
11 D C 0.935 0.036 0.035
O00238 115 03
11 K C 0.449 0.252 0.270
O00238 126 04
Even lines are the output of sed -n 11,1p, odd lines the output of echo $0 $1 $2
How do I pair the output on the same line i.e.
11 D C 0.935 0.036 0.035 O00238 115 03
11 K C 0.449 0.252 0.270 O00238 126 04
I have tried:
printf '%s\n' "${fa[#]}" | xargs -n 3 bash -c 'cat *-$2.ss | {sed -n 11,1p ; echo $0 $1 $2;} | tr "\n" " "'
as suggested here: Concatenate in bash the output of two commands without newline character
however I get
O00238: -c: line 0: syntax error near unexpected token `}'
O00238: -c: line 0: `cat *-$2.ss | {sed -n 11,1p ; echo $0 $1 $2;} | tr "\n" " "'
What is the problem?

You could try using something like this:
i=0
for f in *-"$2".ss; do printf '%s %s\n' "$(sed -n '11p' "$f")" "${fa[$((i++))]}"; done
This loops through your files and prints the 11th line alongside a slice from the array fa, whose index i increases by 1 every iteration.

I could not reproduce your setup, but
printf "O00238 115 03\nO00238 126 04" | xargs -n 3 bash -c 'cat test.dat | sed -n 11,1p | tr -d "\n"; echo " $0 $1 $2"'
gives
11 D C 0.935 0.036 0.035 O00238 115 03
11 D C 0.935 0.036 0.035 O00238 126 04
which should work in your case. I just deleted the newline of the sed command.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

subtracting data from columns in bash csv - bash

Using this awk: awk -v OFS='\t' '{p=$1;q=$2;sub(/\./, "", p); sub(/\./, "", q); print $0, (p-q)}' file 1.000 900 100 1.012 1.010 2 1.015 1.005 10 1.020 1.010 10

Using perl: perl -lanE '$,="\t",($x,$y)=map{s/\.//r}#F;say#F,$x-$y' file prints: 1.000 900 100 1.012 1.010 2 1.015 1.005 10 1.020 1.010 10

Related

Bash/Awk: Find common translocations in two files using overlapping coordinates

how to get the mean of a column between two patterns iteratively using awk.

AWK print group of lines twice if conditions met

Vlookup in awk: how to list anything occuring in file2 but not in file1 at the end of output?

output two commands on same line

Categories

Resources