Can anybody advise an alternative to importing a couple of
GByte of numeric data (in .mx form) from a list of 60 .mx files, each about 650 MByte?
The - too large to post here - research-problem involved simple statistical operations
with double as much GB of data (around 34) than RAM available (16).
To handle the data size problem I just split things up and used
a Get / Clear strategy to do the math.
It does work, but calling Get["bigfile.mx"] takes quite some time, so I was wondering if it would be quicker to use BLOBs or whatever with PostgreSQL or MySQL or whatever database people use for GB of numeric data.
So my question really is:
What is the most efficient way to handle truly large data set imports in Mathematica?
I have not tried it yet, but I think that SQLImport from DataBaseLink will be slower than Get["bigfile.mx"].
Anyone has some experience to share?
(Sorry if this is not a very specific programming question, but it would really help me to move on with that time-consuming finding-out-what-is-the-best-of-the-137-possibilities-to-tackle-a-problem-in-Mathematica).
Here's an idea:
You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.
You could flatten the matrix to get a single long packed 1D array (use Developer`ToPackedArray to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.
Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.
data = RandomReal[1, 10000000];
indexes = Union#RandomInteger[{1, 10000000}, 10000];
ranges = #1 ;; (#2 - 1) & ### Partition[indexes, 2, 1];
data[[#]] & /# ranges; // Timing
{0.093, Null}
Alternatively store a sequence of sublist lengths and use Mr.Wizard's dynamicPartition function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.
Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).
If BinaryReadList were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:
First make a test file:
In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]
In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}
In[5]:= Close[f]
Open it:
In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]
In[7]:= StreamPosition[f]
Out[7]= 0
Skip the first 5 million entries:
In[8]:= SetStreamPosition[f, 5000000*8]
Out[8]= 40000000
Read 5 million entries:
In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing
Out[9]= {0.609, 5000000}
Read all the remaining entries:
In[10]:= BinaryReadList[f, "Real64"] // Length // Timing
Out[10]= {7.782, 70000000}
In[11]:= Close[f]
(For comparison, Get usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)
EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an MTensor object (an MTensor is just a packed Mathematica array, as seen from the C side of Library Link).
I think the two best approaches are either:
1) use Get on the *.mx file,
2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.
Generally speaking Get with MX files is pretty fast.
Are sure this is not a swapping problem?
Edit 1:
You could then use also write in an import converter: tutorial/DevelopingAnImportConverter
Related
I'm trying to manipulate sparse binary matrices in GNU Octave, and it's using way more memory than I expect, and relevant sparse-matrix functions don't behave the way I want them to. I see this question about higher-than-expected sparse-matrix storage in MATLAB, which suggests that this matrix should consume even more memory, but helped explain (only) part of this situation.
For a sparse, binary matrix, I can't figure out any way to get Octave to NOT STORE the array of values (they're always implicitly 1, so need not be stored). Can this be done? Octave always seems to consume memory for a values array.
A trimmed-down example demonstrating the situation: create random sparse matrix, turn it into "binary":
mys=spones(sprandn(1024,1024,.03)); nnz(mys), whos mys
Shows the situation. The consumed size is consistent with the storage mechanism outlined in aforementioned SO answer and expanded below, if spones() creates an array of storage-class double and if all indices are 32-bit (i.e., TotalStorageSize - rowIndices - columnIndices == NumNonZero*sizeof(double) -- unnecessarily storing these values (all 1s as doubles) is over half of the total memory consumed by this 3%-sparse object.
After messing with this (for too long) while composing this question, I discovered some partial workarounds, so I'm going to "self-answer" (only) part of the question for continuity (hopefully), but I didn't figure out an adequate answer to main question:
How do I create an efficiently-stored ("no-/implicit-values") binary matrix in Octave?
Additional background on storage format follows...
The Octave docs say the storage format for sparse matrices uses format Compressed Sparse Column (CSC). This seems to imply storing the following arrays (expanding on aforementioned SO answer, with canonical Yale format labels and tweaks for column-major order):
values (A), number-of-nonzeros (NNZ) entries of storage-class size;
row numbers (IA), NNZ entries of index size (hopefully int64 but maybe int32);
start of each column (JA), number-of-columns-plus-1 entries of index size)
In this case, for binary-only storage, I hope there's a way to completely avoid storing array (A), but I can't figure it out.
Full disclosure: As noted above, as I was composing this question, I discovered a workaround to reduce memory usage, so I'm "self-answering" part of this here, but it still isn't fully satisfying, so I'm still listening for a better actual answer to storage of a sparse binary matrix without a trivial, bloated, unnecessary values array...
To get a binary-like value out of a number-like value and reduce the memory usage in this case, use "logical" storage, created by logical(X). For example, building from above,
logicalmys = logical(mys);
creates a sparse bool matrix, that takes up less memory (1-byte logical rather than 8-byte double for the values array).
Adding more information to the whos information using whos_line_format helps illuminate the situation: The default string includes 5 of the 7 properties (see docs for more). I'm using the format string
whos_line_format(" %a:4; %ln:6; %cs:16:6:1; %rb:12; %lc:8; %e:10; %t:20;\n")
to add display of "elements", and "type" (which is distinct from "class").
With that, whos mys logicalmys shows something like
Attr Name Size Bytes Class Elements Type
==== ==== ==== ===== ===== ======== ====
mys 1024x1024 391100 double 32250 sparse matrix
logicalmys 1024x1024 165350 logical 32250 sparse bool matrix
So this shows a distinction between sparse matrix and sparse bool matrix. However, the total memory consumed by logicalmys is consistent with actually storing an array of NNZ booleans (1-byte) -- That is:
totalMemory minus rowIndices minus columnOffsets leaves NNZ bytes left;
in numbers,
165350 - 32250*4 - 1025*4 == 32250.
So we're still storing 32250 elements, all of which are 1. Further, if you set one of the 1-elements to zero, it reduces the reported storage! For a good time, try: pick a nonzero element, e.g., (42,1), then zero it: logicalmys(42,1) = 0; then whos it!
My hope is that this is correct, and that this clarifies some things for those who might be interested. Comments, corrections, or actual answers welcome!
I am using Mathematica 7 to process a large data set. The data set is a three-dimensional array of signed integers. The three levels may be thought of as corresponding to X points per shot, Y shots per scan, and Z scans per set.
I also have a "zeroing" shot (containing X points, which are signed fractions of integers), which I would like to subtract from every shot in the data set. Afterwards, I will never again need the original data set.
How can I perform this transformation without creating new copies of the data set, or parts of it, in the process? Conceptually, the data set is located in memory, and I would like to scan through each element, and change it at that location in memory, without permanently copying it to some other memory location.
The following self-contained code captures all the aspects of what I am trying to do:
(* Create some offsetted data, and a zero data set. *)
myData = Table[Table[Table[RandomInteger[{1, 100}], {k, 500}], {j, 400}], {i, 200}];
myZero = Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];
(* Method 1 *)
myData = Table[
f1 = myData[[i]];
Table[
f2 = f1[[j]];
f2 - myZero, {j, 400}], {i, 200}];
(* Method 2 *)
Do[
Do[
myData[[i]][[j]] = myData[[i]][[j]] - myZero, {j, 400}], {i, 200}]
(* Method 3 *)
Attributes[Zeroing] = {HoldFirst};
Zeroing[x_] := Module[{},
Do[
Do[
x[[i]][[j]] = x[[i]][[j]] - myZero, {j, Length[x[[1]]]}
], {i, Length[x]}
]
];
(Note: Hat tip to Aaron Honecker for Method #3.)
On my machine (Intel Core2 Duo CPU 3.17 GHz, 4 GB RAM, 32-bit Windows 7), all three methods use roughly 1.25 GB of memory, with #2 and #3 fairing slightly better.
If I don't mind losing precision, wrapping N[ ] around the innards of myData and myZero when they're being created increases their size in memory by 150 MB initially but reduces the amount of memory required for zeroing (by methods #1-#3 above) from 1.25 GB down to just 300 MB! That's my working solution, but it would be great to know the best way of handling this problem.
Unfortunately I have little time now, so I must be concise ...
When working with large data, you need to be aware that Mathematica has a different storage format called packed arrays which is much more compact and much faster than the regular one, but only works for machine reals or integers.
Please evaluate ?Developer`*Packed* to see what functions are available for directly converting to/from them, if this doesn't happen automatically.
So the brief explanation behind why my solution is fast and memory efficient is that it uses packed arrays. I tested using Developer`PackedArrayQ that my arrays never get unpacked, and I used machine reals (I applied N[] to everything)
In[1]:= myData = N#RandomInteger[{1, 100}, {200, 400, 500}];
In[2]:= myZero =
Developer`ToPackedArray#
N#Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];
In[3]:= myData = Map[# - myZero &, myData, {2}]; // Timing
Out[3]= {1.516, Null}
Also, the operation you were asking for ("I would like to scan through each element, and change it at that location in memory") is called mapping (see Map[] or /#).
Let me start by noting that this answer must be viewed as complementary to the one by #Szabolcs, with the latter being, in my conclusion, the better option. While the solution of #Szabolcs is probably the fastest and best overall, it falls short of the original spec in that Map returns a (modified) copy of the original list, rather than "scan each element, and change it at that location in memory". Such behavior, AFAIK, is only provided by Part command. I will use his ideas (converting everything into packed arrays), to show the code that does in-memory changes to the original list:
In[5]:=
Do[myData[[All,All,i]]=myData[[All,All,i]]- myZero[[i]],
{i,Last#Dimensions#myData}];//Timing
Out[5]= {4.734,Null}
This is conceptually equivalent to method 3 listed in the question, but runs much faster because this is a partly vectorized solution and only a single loop is needed. This is however still at least order of magnitude slower than the solution of #Szabolcs.
In theory, this seems to be a classic speed/memory tradeoff: if you need speed and have some spare memory, #Szabolcs's solution is the way to go. If your memory requirements are tough, in theory this slower method would save on intermediate memory consumption (in the method of #Szabolcs, the original list is garbage-collected after the myData is assigned the result of Map, so the final memory usage is the same, but during the computation, one extra array of the size of myData is maintained by Map).
In practice, however, the memory consumption seems to not be smaller, since an extra copy of the list is for some reason maintained in the Out variable in both cases during (or right after) the computation, even when the output is suppressed (it may also be that this effect does not manifest itself in all cases). I don't quite understand this yet, but my current conclusion is that the method of #Szabolcs is just as good in terms of intermediate memory consumption as the present one based on the in-place list modifications. Therefore, his method seems the way to go in all cases, but I still decided to publish this answer as a complementary.
I've been looking at the ways to check arguments of functions. I noticed that
MatrixQ takes 2 arguments, the second is a test to apply to each element.
But ListQ only takes one argument. (also for some reason, ?ListQ does not have a help page, like ?MatrixQ does).
So, for example, to check that an argument to a function is a matrix of numbers, I write
ClearAll[foo]
foo[a_?(MatrixQ[#, NumberQ] &)] := Module[{}, a + 1]
What would be a good way to do the same for a List? This below only checks that the input is a List
ClearAll[foo]
foo[a_?(ListQ[#] &)] := Module[{}, a + 1]
I could do something like this:
ClearAll[foo]
foo[a_?(ListQ[#] && (And ## Map[NumberQ[#] &, # ]) &)] := Module[{}, a + 1]
so that foo[{1, 2, 3}] will work, but foo[{1, 2, x}] will not (assuming x is a symbol). But it seems to me to be someone complicated way to do this.
Question: Do you know a better way to check that an argument is a list and also check the list content to be Numbers (or of any other Head known to Mathematica?)
And a related question: Any major run-time performance issues with adding such checks to each argument? If so, do you recommend these checks be removed after testing and development is completed so that final program runs faster? (for example, have a version of the code with all the checks in, for the development/testing, and a version without for production).
You might use VectorQ in a way completely analogous to MatrixQ. For example,
f[vector_ /; VectorQ[vector, NumericQ]] := ...
Also note two differences between VectorQ and ListQ:
A plain VectorQ (with no second argument) only gives true if no elements of the list are lists themselves (i.e. only for 1D structures)
VectorQ will handle SparseArrays while ListQ will not
I am not sure about the performance impact of using these in practice, I am very curious about that myself.
Here's a naive benchmark. I am comparing two functions: one that only checks the arguments, but does nothing, and one that adds two vectors (this is a very fast built-in operation, i.e. anything faster than this could be considered negligible). I am using NumericQ which is a more complex (therefore potentially slower) check than NumberQ.
In[2]:= add[a_ /; VectorQ[a, NumericQ], b_ /; VectorQ[b, NumericQ]] :=
a + b
In[3]:= nothing[a_ /; VectorQ[a, NumericQ],
b_ /; VectorQ[b, NumericQ]] := Null
Packed array. It can be verified that the check is constant time (not shown here).
In[4]:= rr = RandomReal[1, 10000000];
In[5]:= Do[add[rr, rr], {10}]; // Timing
Out[5]= {1.906, Null}
In[6]:= Do[nothing[rr, rr], {10}]; // Timing
Out[6]= {0., Null}
Homogeneous non-packed array. The check is linear time, but very fast.
In[7]:= rr2 = Developer`FromPackedArray#RandomInteger[10000, 1000000];
In[8]:= Do[add[rr2, rr2], {10}]; // Timing
Out[8]= {1.75, Null}
In[9]:= Do[nothing[rr2, rr2], {10}]; // Timing
Out[9]= {0.204, Null}
Non-homogeneous non-packed array. The check takes the same time as in the previous example.
In[10]:= rr3 = Join[rr2, {Pi, 1.0}];
In[11]:= Do[add[rr3, rr3], {10}]; // Timing
Out[11]= {5.625, Null}
In[12]:= Do[nothing[rr3, rr3], {10}]; // Timing
Out[12]= {0.282, Null}
Conclusion based on this very simple example:
VectorQ is highly optimized, at least when using common second arguments. It's much faster than e.g. adding two vectors, which itself is a well optimized operation.
For packed arrays VectorQ is constant time.
#Leonid's answer is very relevant too, please see it.
Regarding the performance hit (since your first question has been answered already) - by all means, do the checks, but in your top-level functions (which receive data directly from the user of your functionality. The user can also be another independent module, written by you or someone else). Don't put these checks in all your intermediate functions, since such checks will be duplicate and indeed unjustified.
EDIT
To address the problem of errors in intermediate functions, raised by #Nasser in the comments: there is a very simple technique which allows one to switch pattern-checks on and off in "one click". You can store your patterns in variables inside your package, defined prior to your function definitions.
Here is an example, where f is a top-level function, while g and h are "inner functions". We define two patterns: for the main function and for the inner ones, like so:
Clear[nlPatt,innerNLPatt ];
nlPatt= _?(!VectorQ[#,NumericQ]&);
innerNLPatt = nlPatt;
Now, we define our functions:
ClearAll[f,g,h];
f[vector:nlPatt]:=g[vector]+h[vector];
g[nv:innerNLPatt ]:=nv^2;
h[nv:innerNLPatt ]:=nv^3;
Note that the patterns are substituted inside definitions at definition time, not run-time, so this is exactly equivalent to coding those patterns by hand. Once you test, you just have to change one line: from
innerNLPatt = nlPatt
to
innerNLPatt = _
and reload your package.
A final question is - how do you quickly find errors? I answered that here, in sections "Instead of returning $Failed, one can throw an exception, using Throw.", and "Meta-programming and automation".
END EDIT
I included a brief discussion of this issue in my book here. In that example, the performance hit was on the level of 10% increase of running time, which IMO is borderline acceptable. In the case at hand, the check is simpler and the performance penalty is much less. Generally, for a function which is any computationally-intensive, correctly-written type checks cost only a small fraction of the total run-time.
A few tricks which are good to know:
Pattern-matcher can be very fast, when used syntactically (no Condition or PatternTest present in the pattern).
For example:
randomString[]:=FromCharacterCode#RandomInteger[{97,122},5];
rstest = Table[randomString[],{1000000}];
In[102]:= MatchQ[rstest,{__String}]//Timing
Out[102]= {0.047,True}
In[103]:= MatchQ[rstest,{__?StringQ}]//Timing
Out[103]= {0.234,True}
Just because in the latter case the PatternTest was used, the check is much slower, because evaluator is invoked by the pattern-matcher for every element, while in the first case, everything is purely syntactic and all is done inside the pattern-matcher.
The same is true for unpacked numerical lists (the timing difference is similar). However, for packed numerical lists, MatchQ and other pattern-testing functions don't unpack for certain special patterns, moreover, for some of them the check is instantaneous.
Here is an example:
In[113]:=
test = RandomInteger[100000,1000000];
In[114]:= MatchQ[test,{__?IntegerQ}]//Timing
Out[114]= {0.203,True}
In[115]:= MatchQ[test,{__Integer}]//Timing
Out[115]= {0.,True}
In[116]:= Do[MatchQ[test,{__Integer}],{1000}]//Timing
Out[116]= {0.,Null}
The same, apparently, seems to be true for functions like VectorQ, MatrixQ and ArrayQ with certain predicates (NumericQ) - these tests are extremely efficient.
A lot depends on how you write your test, i.e. to what degree you reuse the efficient Mathematica structures.
For example, we want to test that we have a real numeric matrix:
In[143]:= rm = RandomInteger[10000,{1500,1500}];
Here is the most straight-forward and slow way:
In[144]:= MatrixQ[rm,NumericQ[#]&&Im[#]==0&]//Timing
Out[144]= {4.125,True}
This is better, since we reuse the pattern-matcher better:
In[145]:= MatrixQ[rm,NumericQ]&&FreeQ[rm,Complex]//Timing
Out[145]= {0.204,True}
We did not utilize the packed nature of the matrix however. This is still better:
In[146]:= MatrixQ[rm,NumericQ]&&Total[Abs[Flatten[Im[rm]]]]==0//Timing
Out[146]= {0.047,True}
However, this is not the end. The following one is near instantaneous:
In[147]:= MatrixQ[rm,NumericQ]&&Re[rm]==rm//Timing
Out[147]= {0.,True}
Since ListQ just checks that the head is List, the following is a simple solution:
foo[a:{___?NumberQ}] := Module[{}, a + 1]
In Mathematica as in other systems of computer math the numbers are internally stored in binary form. However when exporting them with such functions as Put and PutAppend they are converted into approximate decimals. When you import them back with such functions as Get they are restored from this approximate decimal representation to binary form.
The question is whether the recovered number is always identical to the original binary number and, if not always, in which cases it is not and how large can be the difference? I am particularly interested in the Put - Get cycle (on the same computer system).
The following two simple experiments show that probably the Put - Get cycle in Mathematica always restores original numbers exactly even for arbitrary precision numbers:
In[1]:= list=RandomReal[{-10^6,10^6},10000];
Put[list,"test.txt"];
list2=Get["test.txt"];
Order[list,list2]===0
Order[Total#Abs[list-list2],0.]===0
Out[4]= True
Out[5]= True
In[6]:= list=SetPrecision[RandomReal[{-10^6,10^6},10000],50];
Put[list,"test.txt"];
list2=Get["test.txt"];
Order[list,list2]===0
Total#Abs[list-list2]//InputForm
Out[9]= True
Out[10]//InputForm=
0``39.999515496936205
But maybe I am missing something?
UPDATE
With more correct test code I have found that in reality these tests show only that restored numbers have identical binary RealDigits but their Precisions may differ even in Equal sense. Here are more correct tests:
test := (Put[list, "test.txt"];
list2 = Get["test.txt"];
{Order[list, list2] === 0,
Order[Total#Abs[list - list2], 0.] === 0,
Total[Order ### RealDigits[Transpose[{list, list2}], 2]],
Total[Order ### Map[Precision, Transpose[{list, list2}], {-1}]],
Total[1 - Boole[Equal ### Map[Precision, Transpose[{list, list2}], {-1}]]]})
In[8]:= list=RandomReal[NormalDistribution[],10000]^1001;
test
Out[9]= {False,True,0,1,3}
In[6]:= list=RandomReal[NormalDistribution[],10000,WorkingPrecision->50]^1001;
test
Out[7]= {False,False,0,-2174,1}
I'm afraid I can't give a definitive answer. If you look into the text file you see it's stored as something like the InputForm of the values, including the precision indication for non-machine precision numbers.
Assuming that Get uses the same conversion routines as ImportString and ExportString your test can be sped up a tiny bit.
Monitor[
Do[
i = RandomReal[{$MinMachineNumber, 10 $MinMachineNumber}, 100000];
If[i =!=
ToExpression[ImportString[ExportString[i, "Text"], "List"]],
Print[i]], {n, 100}
],
n]
I have tested this for several hundreds of millions of numbers in various ranges between $MinMachineNumber and $MaxMachineNumber and I always get back the original numbers. It's no proof, of course, but it seems unlikely that you're going to see numbers for which this is not true if there are any (and in that case the difference would be so tiny as to be negligible).
One important thing to know is that Put[] / Get[] doesn't keep packed arrays packed. You should check out DumpSave[]. It's much faster as it's a binary format and keeps arrays packed.
I am a beginner trying to do some engineering experiments using fortran 77. I am using Force 2.0 compiler and editor. I have the following queries:
How can I generate a random number between a specified range, e.g. if I need to generate a single random number between 3.0 and 10.0, how can I do that?
How can I use the data from a text file to be called in calculations in my program. e.g I have temperature, pressure and humidity values (hourly values for a day, so total 24 values in each text file).
Do I also need to define in the program how many values are there in the text file?
Knuth has released into the public domain sources in both C and FORTRAN for the pseudo-random number generator described in section 3.6 of The Art of Computer Programming.
2nd question:
If your file, for example, looks like:
hour temperature pressure humidity
00 15 101325 60
01 15 101325 60
... 24 of them, for each hour one
this simple program will read it:
implicit none
integer hour, temp, hum
real p
character(80) junkline
open(unit=1, file='name_of_file.dat', status='old')
rewind(1)
read(1,*)junkline
do 10 i=1,24
read(1,*)hour,temp,p,hum
C do something here ...
10 end
close(1)
end
(the indent is a little screwed up, but I don't know how to set it right in this weird environment)
My advice: read up on data types (INTEGER, REAL, CHARACTER), arrays (DIMENSION), input/output (READ, WRITE, OPEN, CLOSE, REWIND), and loops (DO, FOR), and you'll be doing useful stuff in no time.
I never did anything with random numbers, so I cannot help you there, but I think there are some intrinsic functions in fortran for that. I'll check it out, and report tomorrow. As for the 3rd question, I'm not sure what you ment (you don't know how many lines of data you'll be having in a file ? or ?)
You'll want to check your compiler manual for the specific random number generator function, but chances are it generates random numbers between 0 and 1. This is easy to handle - you just scale the interval to be the proper width, then shift it to match the proper starting point: i.e. to map r in [0, 1] to s in [a, b], use s = r*(b-a) + a, where r is the value you got from your random number generator and s is a random value in the range you want.
Idigas's answer covers your second question well - read in data using formatted input, then use them as you would any other variable.
For your third question, you will need to define how many lines there are in the text file only if you want to do something with all of them - if you're looking at reading the line, processing it, then moving on, you can get by without knowing the number of lines ahead of time. However, if you are looking to store all the values in the file (e.g. having arrays of temperature, humidity, and pressure so you can compute vapor pressure statistics), you'll need to set up storage somehow. Typically in FORTRAN 77, this is done by pre-allocating an array of a size larger than you think you'll need, but this can quickly become problematic. Is there any chance of switching to Fortran 90? The updated version has much better facilities for dealing with standardized dynamic memory allocation, not to mention many other advantages. I would strongly recommend using F90 if at all possible - you will make your life much easier.
Another option, depending on the type of processing you're doing, would be to investigate algorithms that use only single passes through data, so you won't need to store everything to compute things like means and standard deviations, for example.
This subroutine generate a random number in fortran 77 between 0 and ifin
where i is the seed; some great number such as 746397923
subroutine rnd001(xi,i,ifin)
integer*4 i,ifin
real*8 xi
i=i*54891
xi=i*2.328306e-10+0.5D00
xi=xi*ifin
return
end
You may modifies in order to take a certain range.