Convert code from Matlab to Mathematica - wolfram-mathematica

I need to convert some code from Matlab to Mathematica.
At some point I have
fspecial('gaussian', 11, 1.5)
I am confused about what will be equivalent to write in Mathematica.
In Matlab I get:
0.0000 0.0000 0.0000 0.0001 0.0002 0.0003 0.0002 0.0001 0.0000 0.0000 0.0000
0.0000 0.0001 0.0003 0.0008 0.0016 0.0020 0.0016 0.0008 0.0003 0.0001 0.0000
0.0000 0.0003 0.0013 0.0039 0.0077 0.0096 0.0077 0.0039 0.0013 0.0003 0.0000
0.0001 0.0008 0.0039 0.0120 0.0233 0.0291 0.0233 0.0120 0.0039 0.0008 0.0001
0.0002 0.0016 0.0077 0.0233 0.0454 0.0567 0.0454 0.0233 0.0077 0.0016 0.0002
0.0003 0.0020 0.0096 0.0291 0.0567 0.0708 0.0567 0.0291 0.0096 0.0020 0.0003
0.0002 0.0016 0.0077 0.0233 0.0454 0.0567 0.0454 0.0233 0.0077 0.0016 0.0002
0.0001 0.0008 0.0039 0.0120 0.0233 0.0291 0.0233 0.0120 0.0039 0.0008 0.0001
0.0000 0.0003 0.0013 0.0039 0.0077 0.0096 0.0077 0.0039 0.0013 0.0003 0.0000
0.0000 0.0001 0.0003 0.0008 0.0016 0.0020 0.0016 0.0008 0.0003 0.0001 0.0000
0.0000 0.0000 0.0000 0.0001 0.0002 0.0003 0.0002 0.0001 0.0000 0.0000 0.0000
I need to get the same in Mathematica too.
Thank you in advance

According to the matlab documentation, this command creates a correlation kernel for a gaussian filter. In mathematica, you can simply use ImageCorrelate, and pass this kernel as the second argument.

GaussianMatrix[{5, 1.5}, Method -> "Gaussian"]
5 is the radius ((11 - 1) / 2)
1.5 is the standard deviation
Setting the Method to "Gaussian" makes Mathematica use Matlab's equations

Related

Windows batch replace empty cells with above content

I have a text-file with some listing as shown below.I want to fill in missing numbers in the first columns as shown.
Typical original text:
5 401 6 5.80 0.15 -3.56 0.61 -0.02 0.96
8 -6.11 -0.64 4.07 0.24 0.20 0.38
402 6 -0.33 1.07 0.30 1.29 -0.00 2.04
8 0.02 -0.59 0.21 0.50 0.22 0.79
403 6 3.77 -0.70 -2.74 -0.94 0.20 -1.48
8 -4.08 0.22 2.23 -0.06 -0.19 -0.09
404 6 -2.36 0.22 1.12 -0.26 0.21 -0.41
8 2.05 0.27 -1.63 0.20 -0.16 0.32
16 401 16 -6.30 -0.76 -3.61 0.64 -0.22 -1.01
227 5.99 0.27 4.12 0.47 0.15 -0.74
402 16 -12.50 0.14 -7.52 -0.01 -0.24 0.02
227 12.19 0.35 8.03 0.24 0.13 -0.38
403 16 20.48 0.19 12.84 -0.29 0.03 0.46
227 -20.79 -0.68 -13.35 -0.64 -0.18 1.02
404 16 14.28 1.09 8.93 -0.94 0.01 1.48
227 -14.59 -0.60 -9.44 -0.87 -0.21 1.38
709 401 374 -1.17 -0.99 25.11 0.63 -1.12 -0.11
204 1.05 0.79 -24.91 -0.19 -0.62 0.06
402 374 -1.55 1.09 30.49 -0.90 -1.40 0.14
204 1.43 -0.90 -30.28 0.41 -0.79 -0.09
403 374 1.90 -1.58 0.79 1.65 0.50 -0.21
204 -2.02 1.38 -0.99 -0.93 0.41 0.14
404 374 1.51 0.50 6.16 0.12 0.22 0.04
204 -1.64 -0.31 -6.37 -0.32 0.24 -0.02
How I want it to be:
5 401 6 5.80 0.15 -3.56 0.61 -0.02 0.96
5 401 8 -6.11 -0.64 4.07 0.24 0.20 0.38
5 402 6 -0.33 1.07 0.30 1.29 -0.00 2.04
5 402 8 0.02 -0.59 0.21 0.50 0.22 0.79
5 403 6 3.77 -0.70 -2.74 -0.94 0.20 -1.48
5 403 8 -4.08 0.22 2.23 -0.06 -0.19 -0.09
5 404 6 -2.36 0.22 1.12 -0.26 0.21 -0.41
5 404 8 2.05 0.27 -1.63 0.20 -0.16 0.32
16 401 16 -6.30 -0.76 -3.61 0.64 -0.22 -1.01
16 401 227 5.99 0.27 4.12 0.47 0.15 -0.74
16 402 16 -12.50 0.14 -7.52 -0.01 -0.24 0.02
16 402 227 12.19 0.35 8.03 0.24 0.13 -0.38
16 403 16 20.48 0.19 12.84 -0.29 0.03 0.46
16 403 227 -20.79 -0.68 -13.35 -0.64 -0.18 1.02
16 404 16 14.28 1.09 8.93 -0.94 0.01 1.48
16 404 227 -14.59 -0.60 -9.44 -0.87 -0.21 1.38
709 401 374 -1.17 -0.99 25.11 0.63 -1.12 -0.11
709 401 204 1.05 0.79 -24.91 -0.19 -0.62 0.06
709 402 374 -1.55 1.09 30.49 -0.90 -1.40 0.14
709 402 204 1.43 -0.90 -30.28 0.41 -0.79 -0.09
709 403 374 1.90 -1.58 0.79 1.65 0.50 -0.21
709 403 204 -2.02 1.38 -0.99 -0.93 0.41 0.14
709 404 374 1.51 0.50 6.16 0.12 0.22 0.04
709 404 204 -1.64 -0.31 -6.37 -0.32 0.24 -0.02
I had a similar problem before, where two "cells" were missing regurlarly (e.g. the 402 to 404 numbers above also were missing. Then I managed to use this script:
for /F "delims=" %%i in ('type "tmp1.txt"') do (
set row=%%i
set cnt=0
for %%l in (%%i) do set /A cnt+=1
if !cnt! equ 7 (
set row=!header! !row!
) else (
for /F "tokens=1,2" %%j in ("%%i") do set header=%%j %%k
)
echo.!row!
) >> "tmp2.txt"
Idea anyone?
Assuming, the file is formatted with spaces (no TABs):
#echo off
setlocal enabledelayedexpansion
(for /f "delims=" %%a in (tmp1.txt) do (
set "line=%%a"
set "col1=!line:~0,3!"
set "col2=!line:~3,5!"
set "rest=!line:~8!"
if "!col1!" == " " (
set "col1=!old1!"
) else (
set "old1=!col1!"
)
if "!col2!" == " " (
set "col2=!old2!"
) else (
set "old2=!col2!"
)
echo !col1!!col2!!rest!
))>tmp2.txt
You will notice, I don't split the lines into tokens with for /f, but take the lines as a whole and "split" them manually to preserve the format (the length of the substring). Then simply replace "empty values" with the saved value from the line before.
Edit in response to I have made a mistake when pasting the original text. There are 4 (empty) spaces before all lines.:
Adapt the counting as follows ( first "token increase the lenght by 4, for the rest add 4 to the start position, keep the lengths unchanged):
set "col1=!line:~0,7!"
set "col2=!line:~7,5!"
set "rest=!line:~12!"
and adapt if "!col1!" == " " ( to if "!col1!" == " " ( (from three to seven spaces)

Method for initial guess of standard deviation of 2d gaussian/gabor?

I'm working on curve fitting software in Matlab. So far it's going pretty well but I need a method of inputting an initial guess for my curve fitting software. I'm given a selection of points, but I need to find an initial guess of the SDx and SDy but I don't know how to do this. Is there anywhere I can learn a good approach to this? Thank you so much!
My data is a 32x32 matrix that looks something like the following:
-0.0027 -0.0034 -0.0034 0.0003 0.0018 0.0028 0.0058 0.0057 0.0008 -0.0053
-0.0023 -0.0008 -0.0007 0.0005 0.0015 0.0033 0.0062 0.0054 0.0029 -0.0029
-0.0018 0.0004 0.0014 0.0009 0.0006 0.0024 0.0047 0.0045 0.0041 0.0009
-0.0034 -0.0020 0.0022 0.0022 -0.0007 0.0003 0.0012 0.0024 0.0022 0.0015
-0.0053 -0.0042 -0.0004 0.0010 -0.0014 -0.0020 -0.0021 -0.0003 0.0002 -0.0014
-0.0070 -0.0034 -0.0008 0.0000 0.0004 0.0032 0.0011 0.0019 0.0026 0.0006
-0.0054 -0.0016 0.0005 0.0012 0.0000 0.0045 0.0033 0.0035 0.0039 0.0013
-0.0050 -0.0015 -0.0009 0.0001 0.0001 0.0013 -0.0022 -0.0010 0.0012 -0.0024
-0.0044 -0.0028 -0.0019 0.0016 0.0026 -0.0005 -0.0057 -0.0057 -0.0042 -0.0057
-0.0037 -0.0022 -0.0024 0.0003 0.0036 0.0002 -0.0045 -0.0055 -0.0039 -0.0032
-0.0045 -0.0012 -0.0016 -0.0016 0.0000 0.0003 -0.0018 -0.0014 0.0025 -0.0015
-0.0047 -0.0028 -0.0028 -0.0021 -0.0041 -0.0025 -0.0008 0.0011 0.0020 -0.0029
-0.0028 -0.0020 -0.0024 -0.0024 -0.0044 -0.0060 -0.0032 0.0009 0.0018 -0.0008
-0.0005 -0.0017 0.0007 0.0025 -0.0020 -0.0030 -0.0010 -0.0011 -0.0004 0.0014
-0.0011 -0.0006 -0.0001 0.0003 -0.0002 0.0012 0.0033 0.0010 -0.0025 -0.0001
-0.0032 -0.0008 0.0001 -0.0039 -0.0022 0.0003 0.0016 0.0016 -0.0009 -0.0008
-0.0060 -0.0019 -0.0005 -0.0033 -0.0039 -0.0032 -0.0018 -0.0004 -0.0012 -0.0004
-0.0077 -0.0049 -0.0039 -0.0039 -0.0049 -0.0044 -0.0039 -0.0047 -0.0034 -0.0031
-0.0054 -0.0026 -0.0030 -0.0046 -0.0071 -0.0048 -0.0028 -0.0051 -0.0046 -0.0042
-0.0049 0.0002 0.0009 -0.0017 -0.0041 -0.0031 -0.0018 -0.0024 -0.0029 -0.0015
-0.0032 -0.0007 0.0021 0.0012 -0.0006 -0.0013 -0.0008

How can I separate a file with columns 1-7 of data into ONE column with all the data in order?

For example, I have all of this data but i want it organized in a way such that the output file is purely numbers, with rows 1-7 corresponding to columns 1-7, then rows 8-14 corresponding to columns 1-7 on the second row, and etc.
Can I do this using awk?
Also
Example of data:
Total 31.6459262.4011 31.6463 31.6463 0.0006 0.0006 0.0007
Total 0.0007 0.0007 0.0007 0.0007 0.0007 0.0008 0.0008
Total 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008
Total 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008
Total 0.0008 0.0007 0.0007 0.0007 0.0006 0.0006 0.0006
Total 0.0005 0.0005 0.0004 0.0003 0.0003 0.0002 0.0001
Total 0.0001 0.0000 -0.0001 -0.0002 -0.0002 -0.0003 -0.0004
Total -0.0005 -0.0006 -0.0007 -0.0008 -0.0009 -0.0010 -0.0011
Total -0.0011 -0.0012 -0.0013 -0.0014 -0.0015 -0.0015 -0.0016
Total -0.0016 -0.0017 -0.0018 -0.0018 -0.0018 -0.0019 -0.0019
Total -0.0019 -0.0019 -0.0020 -0.0020 -0.0020 -0.0020 -0.0020
Total -0.0019 -0.0019 -0.0019 -0.0019 -0.0018 -0.0018 -0.0018
Total -0.0017 -0.0017 -0.0017 -0.0016 -0.0016 -0.0015 -0.0015
Total -0.0014 -0.0014 -0.0013 -0.0012 -0.0012 -0.0011 -0.0011
Total -0.0010 -0.0010 -0.0009 -0.0009 -0.0008 -0.0008 -0.0007
Total 31.6459262.4010 31.6461 31.6462 0.0006 0.0006 0.0006
Total 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007
Total 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008
The output is lengthy to type, but it would consist of all these numbers arranged in one column without the four numbers that repeat every so often, 31.6459, 262.4010, 31.6461, and 31.6462. These four numbers are not always exactly the same, but they are certainly always greater than ~20. And they do repeat every 101 numbers.
Output:
0.0006
0.0006
0.0007
0.0007
0.0007
0.0007
0.0007
0.0007
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0007
0.0007
0.0007
0.0006
0.0006
0.0006
0.0005
0.0005
0.0004
0.0003
0.0003
0.0002
0.0001
0.0001
0.0000
-0.0001
-0.0002
-0.0002
-0.0003
-0.0004
-0.0005
-0.0006
-0.0007
-0.0008
-0.0009
-0.0010
-0.0011
-0.0011
-0.0012
-0.0013
-0.0014
-0.0015
-0.0015
-0.0016
-0.0016
-0.0017
-0.0018
-0.0018
-0.0018
-0.0019
-0.0019
-0.0019
-0.0019
-0.0020
-0.0020
-0.0020
-0.0020
-0.0020
-0.0019
-0.0019
-0.0019
-0.0019
-0.0018
-0.0018
-0.0018
-0.0017
-0.0017
-0.0017
-0.0016
-0.0016
-0.0015
-0.0015
-0.0014
-0.0014
-0.0013
-0.0012
-0.0012
-0.0011
-0.0011
-0.0010
-0.0010
-0.0009
-0.0009
-0.0008
-0.0008
-0.0007
0.0006
0.0006
0.0006
0.0007
0.0007
0.0007
0.0007
0.0007
0.0007
0.0007
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
There are PLENTY of numbers that repeat frequently in your data so we can't exclude the ones you mention based on them repeating so - do you want exclude numbers with value >= 20?
If so, this may be what you want using GNU awk for FIELDWIDTHS:
$ awk 'BEGIN{FIELDWIDTHS="8 8 8 8 8 8 8 8"}
{for (i=2;i<=NF;i++) if ($i<20) {sub(/^ +/,"",$i); print $i} }' file
0.0006
0.0006
0.0007
0.0007
0.0007
0.0007
0.0007
0.0007
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0007
0.0007
0.0007
0.0006
0.0006
0.0006
0.0005
0.0005
0.0004
0.0003
0.0003
0.0002
0.0001
0.0001
0.0000
-0.0001
-0.0002
-0.0002
-0.0003
-0.0004
-0.0005
-0.0006
-0.0007
-0.0008
-0.0009
-0.0010
-0.0011
-0.0011
-0.0012
-0.0013
-0.0014
-0.0015
-0.0015
-0.0016
-0.0016
-0.0017
-0.0018
-0.0018
-0.0018
-0.0019
-0.0019
-0.0019
-0.0019
-0.0020
-0.0020
-0.0020
-0.0020
-0.0020
-0.0019
-0.0019
-0.0019
-0.0019
-0.0018
-0.0018
-0.0018
-0.0017
-0.0017
-0.0017
-0.0016
-0.0016
-0.0015
-0.0015
-0.0014
-0.0014
-0.0013
-0.0012
-0.0012
-0.0011
-0.0011
-0.0010
-0.0010
-0.0009
-0.0009
-0.0008
-0.0008
-0.0007
0.0006
0.0006
0.0006
0.0007
0.0007
0.0007
0.0007
0.0007
0.0007
0.0007
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
0.0008
I feel like you could have come up with a briefer example btw.

x,y points at 95% confidence interval using awk

I'm pretty new to linux and want to use bash/awk to find the x points where y=.975 and y=.025 (95% confidence interval) which I can then use to give me the 'width' of my broad peak (the data kinda makes a bell curve shape).
This is the set of data with x,y values like so (NOTE: I intend to make dx increment much smaller resulting in much more/finer points):
0 0
0.100893 0
0.201786 0
0.302679 0
0.403571 0
0.504464 0
0.605357 0
0.70625 0
0.807143 0
0.908036 0
1.00893 0
1.10982 0
1.21071 0
1.31161 0
1.4125 0.00173803
1.51339 0.0186217
1.61429 0.0739904
1.71518 0.211295
1.81607 0.725379
1.91696 2.34137
2.01786 4.69752
2.11875 6.58415
2.21964 6.06771
2.32054 8.57593
2.42143 11.7745
2.52232 12.4957
2.62321 13.0301
2.72411 11.1008
2.825 11.4504
2.92589 12.6537
3.02679 12.1584
3.12768 11.0262
3.22857 6.89166
3.32946 5.88521
3.43036 6.48794
3.53125 5.0121
3.63214 2.70189
3.73304 0.914824
3.83393 0.154436
3.93482 0.0286775
4.03571 0.00533823
4.13661 0.00024829
4.2375 0
4.33839 0
4.43929 0
4.54018 0
4.64107 0
4.74196 0
4.84286 0
4.94375 0
5.04464 0
5.14554 0
5.24643 0
5.34732 0
5.44821 0
5.54911 0
First I want to normalise the y data so the values add up to a total of 1 (essentially to give me a probability of finding x at point y).
Then I want to determine the x-values that mark the start and end of the 95% confidence interval for the data set. The way I tackled this was to have a running sum of the column 2 y-values and then do runsum/'sum' in this way the values should fill up from 0-1 (see below). (NOTE: I used column -t to clean up the output a little)
sum=$( awk 'BEGIN {sum=0} {sum+=$2} END {print sum}' mydata.txt )
awk '{runsum += $2} ; {if (runsum!=0) {print $0,$2/'$sum',runsum/'$sum'} else{print $0,"0","0"}}' mydata.txt | column -t
This gives:
0 0 0 0
0.100893 0 0 0
0.201786 0 0 0
0.302679 0 0 0
0.403571 0 0 0
0.504464 0 0 0
0.605357 0 0 0
0.70625 0 0 0
0.807143 0 0 0
0.908036 0 0 0
1.00893 0 0 0
1.10982 0 0 0
1.21071 0 0 0
1.31161 0.00136559 8.92134e-06 8.92134e-06
1.4125 0.0259463 0.000169506 0.000178427
1.51339 0.159775 0.0010438 0.00122223
1.61429 0.552197 0.00360748 0.00482971
1.71518 1.2808 0.00836741 0.0131971
1.81607 2.20568 0.0144096 0.0276067
1.91696 3.29257 0.0215102 0.049117
2.01786 4.27381 0.0279206 0.0770376
2.11875 7.10469 0.0464146 0.123452
2.21964 9.56549 0.062491 0.185943
2.32054 11.3959 0.0744489 0.260392
2.42143 8.16116 0.0533165 0.313709
2.52232 9.08145 0.0593287 0.373037
2.62321 9.3105 0.0608251 0.433863
2.72411 10.8084 0.0706108 0.504473
2.825 10.4597 0.0683328 0.572806
2.92589 9.81763 0.0641382 0.636944
3.02679 9.06295 0.0592079 0.696152
3.12768 8.84222 0.0577659 0.753918
3.22857 10.285 0.0671915 0.82111
3.32946 8.37618 0.0547212 0.875831
3.43036 7.02052 0.0458648 0.921696
3.53125 4.82589 0.0315273 0.953223
3.63214 3.39214 0.0221607 0.975384
3.73304 2.2402 0.0146351 0.990019
3.83393 1.06194 0.00693761 0.996956
3.93482 0.350213 0.00228793 0.999244
4.03571 0.091619 0.000598543 0.999843
4.13661 0.0217254 0.000141931 0.999985
4.2375 0.00211046 1.37875e-05 0.999999
4.33839 0 0 0.999999
4.43929 0 0 0.999999
4.54018 0 0 0.999999
4.64107 0 0 0.999999
4.74196 0 0 0.999999
4.84286 0 0 0.999999
4.94375 0 0 0.999999
5.04464 0 0 0.999999
5.14554 0 0 0.999999
5.24643 0 0 0.999999
5.34732 0 0 0.999999
5.44821 0 0 0.999999
5.54911 0 0 0.999999
I guess I could use this to find the x points where y=.975 and y=.025 and solve my problem but do you guys know of a more elegant way and is this doing what I think it is?
The 95% confidence interval is displayed at the bottom of the output:
$ awk -v "sum=$sum" -v lower=N -v upper=N '{runsum += $2; cdf=runsum/sum; printf "%10.4f %10.4f %10.4f %10.4f",$1,$2,$2/sum,cdf; print ""} lower=="N" && cdf>0.025{lower=$1} upper=="N" && cdf>0.975 {upper=$1} END{printf "lower=%s upper=%s\n",lower,upper}' mydata.txt
0.0000 0.0000 0.0000 0.0000
0.1009 0.0000 0.0000 0.0000
0.2018 0.0000 0.0000 0.0000
0.3027 0.0000 0.0000 0.0000
0.4036 0.0000 0.0000 0.0000
0.5045 0.0000 0.0000 0.0000
0.6054 0.0000 0.0000 0.0000
0.7063 0.0000 0.0000 0.0000
0.8071 0.0000 0.0000 0.0000
0.9080 0.0000 0.0000 0.0000
1.0089 0.0000 0.0000 0.0000
1.1098 0.0000 0.0000 0.0000
1.2107 0.0000 0.0000 0.0000
1.3116 0.0000 0.0000 0.0000
1.4125 0.0017 0.0000 0.0000
1.5134 0.0186 0.0001 0.0001
1.6143 0.0740 0.0005 0.0006
1.7152 0.2113 0.0014 0.0020
1.8161 0.7254 0.0047 0.0067
1.9170 2.3414 0.0153 0.0220
2.0179 4.6975 0.0307 0.0527
2.1187 6.5842 0.0430 0.0957
2.2196 6.0677 0.0396 0.1354
2.3205 8.5759 0.0560 0.1914
2.4214 11.7745 0.0769 0.2683
2.5223 12.4957 0.0816 0.3500
2.6232 13.0301 0.0851 0.4351
2.7241 11.1008 0.0725 0.5076
2.8250 11.4504 0.0748 0.5824
2.9259 12.6537 0.0827 0.6651
3.0268 12.1584 0.0794 0.7445
3.1277 11.0262 0.0720 0.8165
3.2286 6.8917 0.0450 0.8616
3.3295 5.8852 0.0384 0.9000
3.4304 6.4879 0.0424 0.9424
3.5312 5.0121 0.0327 0.9751
3.6321 2.7019 0.0177 0.9928
3.7330 0.9148 0.0060 0.9988
3.8339 0.1544 0.0010 0.9998
3.9348 0.0287 0.0002 1.0000
4.0357 0.0053 0.0000 1.0000
4.1366 0.0002 0.0000 1.0000
4.2375 0.0000 0.0000 1.0000
4.3384 0.0000 0.0000 1.0000
4.4393 0.0000 0.0000 1.0000
4.5402 0.0000 0.0000 1.0000
4.6411 0.0000 0.0000 1.0000
4.7420 0.0000 0.0000 1.0000
4.8429 0.0000 0.0000 1.0000
4.9437 0.0000 0.0000 1.0000
5.0446 0.0000 0.0000 1.0000
5.1455 0.0000 0.0000 1.0000
5.2464 0.0000 0.0000 1.0000
5.3473 0.0000 0.0000 1.0000
5.4482 0.0000 0.0000 1.0000
5.5491 0.0000 0.0000 1.0000
lower=2.01786 upper=3.53125
To be more accurate, one would want to interpolate between adjacent values to get the 2.5% and 97.5% limits. You mentioned, however, that your actual dataset has many more data points. In that case, interpolation is superfluous complication.
How it works:
-v "sum=$sum" -v lower=N -v upper=N
Here we define three variables to be used by awk. Note that we define sum here as an awk variable. That allows us to use sum in the awk formulas without the complication of mixing shell variable expansion in with awk code.
runsum += $2; cdf=runsum/sum;
Just as you had it, we compute the running sum, runsum, and the cumulative probability distribution, cdf.
printf "%10.4f %10.4f %10.4f %10.4f",$1,$2,$2/sum,cdf; print ""
Here we print out each line. I took the liberty here of changing the format to something that prints pretty. If you need tab-separated values, then change this back.
lower=="N" && cdf>0.025{lower=$1}
If we have not previously reached the lower confidence limit, then lower is still equal to N. If that is the canse and the current cdf is now greater than 0.025, we set lower to the current value of x.
upper=="N" && cdf>0.975 {upper=$1}
This does the same for the upper confidence limit.
END{printf "lower=%s upper=%s\n",lower,upper}
At the end, this prints the lower and upper confidence limits.

Comparing 2 files using AWK with multiple parameters

I have a problem while comparing 2 text files using awk. Here is what I want to do.
File1 contains a name in the first column which has to match the name in the first column of file2. That's easy - so far so good. Then if this matches, I need to check whether the number in the 2nd column of file1 lays within the numeric range of column 2 and 3 in file2 (see example). If that's the case print both matching lines as one line to a new file. I wrote something in awk and it gives me an output with correct assignments but it misses the majority. Am I missing some kind of loop function? The files are both sorted according to the first column.
File1:
scaffold10| 300 T C 0.9695 0.0000
scaffold10| 456 T A 1.0000 0.0000
scaffold10| 470 C A 0.9906 0.0000
scaffold10| 600 T C 0.8423 0.0000
scaffold56| 5 A C 0.8423 0.0000
scaffold56| 1000 C T 0.8423 0.0000
scaffold56| 6000 C C 0.7518 0.0000
scaffold7| 2 T T 0.9046 0.0000
scaffold9| 300 T T 0.9034 0.0000
scaffold9| 10900 T G 0.9044 0.0000
File2:
scaffold10| 400 550
scaffold10| 700 800
scaffold56| 3 5000
scaffold7| 55 200
scaffold7| 214 567
scaffold7| 656 800
scaffold9| 234 675
scaffold9| 699 1254
scaffold9| 10887 11000
Output:
scaffold10| 456 T A 1.0000 0.0000 scaffold10| 400 550
scaffold10| 470 C A 0.9906 0.0000 scaffold10| 400 550
scaffold56| 5 A C 0.8423 0.0000 scaffold56| 3 5000
scaffold56| 1000 C T 0.8423 0.0000 scaffold56| 3 5000
scaffold9| 300 T T 0.9034 0.0000 scaffold9| 234 675
scaffold9| 10900 T G 0.9044 0.0000 scaffold9| 10887 11000
My awk try:
awk -F "\t" ' FNR==NR {b[$1]=$0; c[$1]=$1; d[$1]=$2; e[$1]=$3; next} for {if (c[$1]==$1 && d[$1]<=$2 && e[$1]>=$2) {print b[$1]"\t"$0}}' File1 File2 > out.txt
How can I get the output I want using awk? Any suggestions are very welcome...
Use join to do a database style join of the two files and then use AWK to filter out the incorrect matches:
$ join file1 file2 | awk '$2 >= $7 && $2 <= $8'
scaffold10| 456 T A 1.0000 0.0000 400 550
scaffold10| 470 C A 0.9906 0.0000 400 550
scaffold56| 5 A C 0.8423 0.0000 3 5000
scaffold56| 1000 C T 0.8423 0.0000 3 5000
scaffold9| 300 T T 0.9034 0.0000 234 675
scaffold9| 10900 T G 0.9044 0.0000 10887 11000
Or if you want the output formatted the same the way it is in the example you gave:
$ join file1 file2 | awk '$2 >= $7 && $2 <= $8 { printf("%-12s %-5s %-3s %-3s %-8s %-8s %-12s %-5s %-5s\n", $1, $2, $3, $4, $5, $6, $1, $7, $8); }'
scaffold10| 456 T A 1.0000 0.0000 scaffold10| 400 550
scaffold10| 470 C A 0.9906 0.0000 scaffold10| 400 550
scaffold56| 5 A C 0.8423 0.0000 scaffold56| 3 5000
scaffold56| 1000 C T 0.8423 0.0000 scaffold56| 3 5000
scaffold9| 300 T T 0.9034 0.0000 scaffold9| 234 675
scaffold9| 10900 T G 0.9044 0.0000 scaffold9| 10887 11000
A awk solution that reads in the first file into an array and then compares it on the fly with the content of the second file.
awk 'NR==FNR{i++; x[i]=$0; x_1[i]=$2; x_2[i]=$3 }
NR!=FNR{ for(j=1;j<=i;j++){
if( $1~x[j] && x_1[j]<$2 && x_2[j]>$2 ){
print $0,x[j]
}
}
}' file2 file1
# scaffold10| 456 T A 1.0000 0.0000 scaffold10| 400 550
# scaffold10| 470 C A 0.9906 0.0000 scaffold10| 400 550
# scaffold56| 5 A C 0.8423 0.0000 scaffold56| 3 5000
# scaffold56| 1000 C T 0.8423 0.0000 scaffold56| 3 5000
# scaffold9| 300 T T 0.9034 0.0000 scaffold9| 234 675
# scaffold9| 10900 T G 0.9044 0.0000 scaffold9| 10887 11000

Resources