I have the following dataset with multiple different ids in column 1 and I wish to calculate the mean and standard deviation for column 2 for each id
123456 0.1234
123456 0.5673
123456 0.0011
123456 -0.0947
123457 0.9938
123457 0.0001
123457 0.2839
I have the following code to get the mean per id but struggling to amend this to get the SD as well
awk '{sum4[$1] += $2; count4[$1]++}; END{ for (id in sum4) { print id, sum4[id]/count4[id] } }' < want3.txt > mean_id.txt
The desired output is a file of id mean and sd
123456 0.149275 0.2926
123457 0.425933 0.5118
Any advice would be much appreciated.
Thanks
here is another approach which is more memory efficient but possibly less precision for large mean.
$ awk -v t=1 '{s[$1]+=$2; ss[$1]+=$2*$2; c[$1]++}
END {for(k in s) print k,m=s[k]/c[k],sqrt((ss[k]-m^2*c[k])/(c[k]-t))}' file
123456 0.149275 0.292628
123457 0.425933 0.51185
this computes the sample standard deviation, if you have the full distribution not just the samples you can set t=0 to get population standard deviation which will be slightly lower but for large N they are practically equivalent (within the error of margin due to measurement errors).
With GNU awk. Derived from Ivan's answer with standard deviation of the population (division by n). I switched to sample standard deviation (division by n-1).
awk '
{
numrec[$1] += 1
sum[$1] += $2
array[numrec[$1]] = $2
array[$1,numrec[$1]] = $2
}
END {
for(w in numrec) {
for(x=1; x<=numrec[w]; x++)
sumsq[w] += ((array[w,x]-(sum[w]/numrec[w]))^2)
printf("%d %.6f %.4f\n", w, sum[w]/numrec[w], sqrt(sumsq[w]/(numrec[w]-1)))
}
}
' file
Output:
123456 0.149275 0.2926
123457 0.425933 0.5118
Image one file with 250 datasets with varying length (2000 +-500) lines and 11 columns. Here a comprehensive small example:
file.sum:
0.00000e+00 9.51287e-09
1.15418e-04 8.51287e-09
4.16445e-04 7.51287e-09
8.53721e-04 6.51287e-09
1.42697e-03 5.51287e-09
1.70302e-03 4.51287e-09
2.27189e-03 3.51287e-09
2.54732e-03 1.51287e-09
3.11304e-03 0.51287e-09
0.00000e+00 13.28378e-09
1.15418e-04 12.28378e-09
3.19663e-04 11.28378e-09
5.78178e-04 10.28378e-09
8.67479e-04 09.28378e-09
1.20883e-03 08.28378e-09
1.58817e-03 07.28378e-09
1.75840e-03 06.28378e-09
2.21069e-03 05.28378e-09
I wanted to display every 10 datasets and normalize it to the first element. The first value to normalize is 9.51287e-09 and the second would be 13.28378e-09. Of course with this massive dataset, I can not do it manually or even split the file.
So far I got every ten'th dataset but with the normalization, I do have my problems.
#!/usr/bin/gnuplot
reset
set xrange [0:0.1]
plot for [val=1:250:10] 'file.sum' i val u 1:11 w l
Working of this example:
plot.gp:
#!/usr/bin/gnuplot
reset
set xrange [0:0.01]
plot for [val=1:2:1] 'file.sum' i val u 1:2 w l
Some hints I found in:
Gnuplot: data normalization
I guess you can write a awk script to handle this, but there may be a more gnuplot friendlier way. Any suggestions are appreciated.
Assuming you have one file with data sections each separated by two or more empty lines you can use the script below.
In gnuplot console check help pseudocolumns. column(-2) tells you in which block you are and column(0) tells you wich line of this block you are (counting starts from 0).
Define a function Normalized(n) which does the following: if you are in the first line of a subblock put the value of column(n) into the variable y0. All values of this block will now be divided by y0. Also check help ternary.
In case you want a legend for the blocks you can plot a dummy plot, actually plotting NaN (i.e. nothing) but place an entry for the key.
Code:
### normalize each block by its first value
reset session
set colorsequence classic
$Data <<EOD
0.00000e+00 9.51287e-09
1.15418e-04 8.51287e-09
4.16445e-04 7.51287e-09
8.53721e-04 6.51287e-09
1.42697e-03 5.51287e-09
1.70302e-03 4.51287e-09
2.27189e-03 3.51287e-09
2.54732e-03 1.51287e-09
3.11304e-03 0.51287e-09
0.00000e+00 13.28378e-09
1.15418e-04 12.28378e-09
3.19663e-04 11.28378e-09
5.78178e-04 10.28378e-09
8.67479e-04 09.28378e-09
1.20883e-03 08.28378e-09
1.58817e-03 07.28378e-09
1.75840e-03 06.28378e-09
2.21069e-03 05.28378e-09
EOD
Normalized(n) = column(n)/(column(0)==0 ? y0=column(n) : y0)
plot $Data u 1:(Normalized(2)):(myBlocks=column(-2)+1) w lp pt 7 lc var notitle, \
for [i=0:myBlocks-1] '' u 1:(NaN) w lp pt 7 lc i+1 ti sprintf("Block %d",i)
### end of code
Result:
I would like to use fortran to read ultraviolet radiation data that has been produced by the Japan Aerospace Exploration Agency. This data is at a daily and monthly temporal resolution from 2000-2010 at a ~5 km spatial resolution. This question is worth answering as the data could be useful for a number of environment/health projects and is freely available, with proper acknowledgement of source and sharing of preprint of any subsequent publications, from:
ftp://suzaku.eorc.jaxa.jp/pub/GLI/glical/Global_05km/monthly/uvb/
There is a readme file available, which provides instructions on how to read data using fortran as follows:
Instructions for _le files
Header
Read header (size= pixel size *2byte):
character head*14400
read(10,rec=1) head
read(head,'(2i6,2f8.2,f8.4,2e12.5,a1,a8,a1,a40)')
& npixel,nline,lon_min,lat_max,reso,slope,offset,',',
& para,',',outfile
Read data (e.g., fortran77)
parameter(nl=7200, ml=3601)
... open file by "unformatted", "recl=nl*2(byte)" (,"bytereclen")
integer*2 i2buf(nl,ml)
do m=1,ml
read(10,rec=1+m) (i2buf(n,m), n=1,nl)
do n=1,nl
par=i2buf(n,m)*slope+offset
write(6,*) 'PAR[Ein/m^2/day]=',par
enddo
enddo
slope values
par__le : daily PAR [Ein/m^2/day] = DN * 0.01
dpar_le : direct PAR = DN * 0.01
swr__le : daily mean shortwave radiation [W/m^2] = DN * 0.01
tip__le : transmittance of instantaneous PAR at noon = DN * 0.0001
uva__le : daily mean UVA [W/m^2] = DN * 0.001
uvb__le : daily mean UVB [W/m^2] = DN * 0.0001
rpar_le : PAR-range surface reflectance (TOP of canopy/solid surfaces) = DN * 0.0001 (monthly data only)
error values
-1 as signed short integer (int16)
65535 as unsigned short integer (uint16)
Progress so far
I have downloaded and installed gfortran successfully on mac OSX. I have downloaded a test file (MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le.gz) and decompressed it. I have created a program file:
PROGRAM readuvr
IMPLICIT NONE
!some code
END PROGRAM
I will then type the following into the command line to create an executable and run it to extract the data.
gfortran -o executable
./executable
As a complete beginner to fortran, my question is: how can I use the instructions provided to build a program that can read the data and output it into a text file?
Well, that file expands to 51,868,800 bytes. The comments imply the header is 14,400 bytes, which leaves 51,854,400 bytes of actual data payload.
There seem to be 7200 lines of data, so that means there are 7202 bytes per line. There seem to be 2 bytes (16-bit samples) so if we assume 2 bytes/sample, that means there are 3601 samples per line, which matches the ml=3601.
So basically, you need to read 14,400 bytes of header, then 7200 lines of data, each line consisting of 3601 values, each of those being 2 bytes wide...
Actually, if you are that unfamiliar with FORTRAN, you may like to extract the data with Perl which is already installed and available on OS X anyway. I have started a VERY SIMPLISTIC Perl program that reads the dat and prints the first 2 values on each line:
#!/usr/bin/perl
use strict;
use warnings;
# Read 14,400 bytes of header
my $buffer;
my $nBytes = 14400;
my $bytesRead = read (STDIN, $buffer, $nBytes) ;
my ($npixel,$nline,$lon_min,$lat_max,$reso,$slope,$offset,$junk)=split(' ',$buffer);
print "npixel:$npixel\n";
print "nline:$nline\n";
print "lon_min:$lon_min\n";
print "lat_max:$lat_max\n";
print "reso:$reso\n";
print "slope:$slope\n";
$offset =~ s/,.*//; # strip trailing comma and junk
print "offset:$offset\n";
# Read actual lines of data
my $line;
for(my $m=1;$m<=$nline;$m++){
read(STDIN,$line,$npixel*2);
my $x=$npixel*2;
my #values=unpack("S$x",$line);
printf "Line: %d",$m;
for(my $j=0;$j<2;$j++){
printf ",%f",$values[$j]*$slope+$offset;
}
printf "\n"; # newline
}
Save it as go.pl and then in the Terminal, type the following once to make it executable
chmod +x go.pl
and then run it like this
./go.pl < MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le
Sample output extract:
npixel:7200
nline:3601
lon_min:0.00
lat_max:90.00
reso:0.0500
slope:0.10000E-03
offset:0.00000E+00
...
...
Line: 3306,0.099800,0.099800
Line: 3307,0.099900,0.099900
Line: 3308,0.099400,0.074200
Line: 3309,0.098900,0.098900
Line: 3310,0.098400,0.098400
Line: 3311,0.074300,0.074200
Line: 3312,0.071300,0.071200
fortran (f2003 or so) solution. (The linked instructions are awful by the way )
implicit none
character*80 para,outfile
character(len=:),allocatable::header,infile
integer npixel,nline,blen,i
c note kind=2 is not standard. This needs to be a 2-byte integer.
integer(kind=2),allocatable :: data(:,:)
real lon_min,lat_max,reso,slope,off
c header is plain text, so first open formatted and
c directly read header data
infile='MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le'
open(10,file=infile)
read(10,*)npixel,nline,lon_min,lat_max,reso,slope,off,
$ para,outfile
close(10)
write(*,*)npixel,nline,lon_min,lat_max,reso,slope,off,
$ trim(para),' ',trim(outfile)
blen=2*npixel
allocate(character(len=blen)::header)
allocate(data(npixel,nline))
if( sizeof(data(1,1)).ne.2 )then
write(*,*)'error kind=2 did not give a 2 byte integer'
stop
endif
c now close and reopen for binary read.
c direct access approach:
open(20,file=infile,access='direct',recl=blen/4)
c note the granularity of the recl= specifier is not standard.
c ifort uses 4 bytes. (note this will break if npixel is not even )
read(20,rec=1)header
write(*,*)trim(header)
do i=1,nline
read(20,rec=i+1)data(:,i)
enddo
c note streams if available is simpler: (we don't need to know rec len )
c open(20,file=infile,access='stream')
c read(20)header,data
end
This is not actually validated because I don't have known file content to compare against.
I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.
I have an input file that contains a list of ip addresses and the ip_counts(some parameter that I use internally.)The file looks somewhat like this.
202.124.127.26 2135869
202.124.127.25 2111217
202.124.127.17 2058082
202.124.127.16 2014958
202.124.127.20 1949323
202.124.127.24 1933773
202.124.127.27 1932076
202.124.127.22 1886466
202.124.127.18 1882955
202.124.127.21 1803528
202.124.127.23 1786348
119.224.129.200 1776592
119.224.129.211 1639325
202.124.127.19 1479198
119.224.129.201 1145426
202.49.175.110 1133354
119.224.129.210 1119525
68.232.45.132 1085491
119.224.129.209 1015078
131.203.3.8 857951
202.162.73.4 817197
207.123.58.125 785326
202.7.6.18 762603
117.121.253.254 718022
74.125.237.120 710448
68.232.44.219 693002
202.162.73.2 671559
205.128.75.126 611301
119.161.91.17 604393
119.224.129.202 559930
8.27.241.126 528862
74.125.237.152 517516
8.254.9.254 514341
As you can see the ip addresses themselves are unsorted.So I use the sort command on the file to sort the ip addresses as below
cat address_count.txt | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n > sorted_address.txt
Which gives me an output with ip addresses in the sorted order.The partial output of that file is shown below.
4.23.63.126 15731
4.26.254.254 320705
4.27.8.254 25174
8.12.129.50 176141
8.12.223.125 11800
8.19.32.65 15854
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
8.26.195.253 28568
8.27.152.253 104446
8.27.233.125 115856
8.27.235.126 16102
8.27.235.254 25628
8.27.238.254 108485
8.27.240.125 169262
8.27.241.126 528862
8.27.241.252 197302
8.27.248.125 14926
8.254.9.254 514341
12.129.210.71 89663
15.192.45.21 20139
15.192.45.26 35265
15.193.0.148 10313
15.193.113.29 40318
15.201.49.136 14243
15.240.238.52 57163
17.250.248.95 28166
23.33.125.13 19179
23.33.125.37 17953
31.151.163.60 72709
38.99.42.37 192356
38.99.68.180 41251
38.99.68.181 10272
38.104.237.74 74012
38.108.112.103 37034
38.108.112.115 69698
38.108.112.121 92173
38.108.112.122 99230
38.112.63.238 39958
38.119.130.62 42159
46.4.28.22 19769
Now I want to parse the file given above and convert it to aaa.bbb.ccc.0/8 format and
aaa.bbb.0.0/16 format and I also want to count the number of occurences of the ip's in each subnet.I want to do this using bash.I am open to using sed or awk.How do I achieve this.
For example
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
The about input portion should produce 8.19.240.0/8 and 8.20.213.0/8 and similarly for /16 domains.I also want to count the occurences of machines in the subnet.
For example In the above output this subnet should have the count 4 in the next column beside it.It should also add the already displayed count.i.e (11013 + 11915 + 31541 + 23304) in another column.
8.19.240.0/8 4 (11013 + 11915 + 31541 + 23304)
8.20.213.0/8 6 (96434 + 108191 + 170058 + 23512 + 10420 + 24809)
It would be great if someone could suggest some way to achieve this.
The main problem here is that without having the routing table from the individual moments the packets arrived, you have no idea what netblock they were originally in. Sure, you can put them in the class-full blocks they would be in, in a class-full routing situation, but all that will give you is a nice presentation (and, admittedly, a shorter file).
Furthermore, your example looks a bit broken. You have a bunch of IP addresses in 8.0.0.0/8 and you are aggregating them into what looks like /24 routes and presenting them with a /8 at the end.
Nonetheless, in awk you can use sub() to do text replacement (or you can use index to find occurrences of ., or you can use split to split at dots). It should be relatively easy to go from that to "drop last digit, add the string "0/24" and use that as a key to update an IP-count and a hit-count dictionary, then drop the last two octets and the slash, replace with "0.0/16" and do the same" (all arrays in awk are associative arrays, so essentially dicts). No need to sort in advance, when you loop through the result, you'll get the keys in a random order, but on average there will be fewer of them, so sorting afterwards will be cheaper.
I seem to not have an awk at hand, so I cannot give you a code example.
This might work for you:
awk '{a=$1;sub(/\.[^.]*$/,"",a);ac[a]++;at[a]+=$2};END{for(x in ac)print x".0/8",ac[x],at[x]}' file
This prints the '0/8 addresses to get the 0/16 duplicate the code i.e. b=a;sub(/\.[^.]*$/,"",b);ba[b]++ etc, etc.