Mathematica running out of memory - memory-management

I'm trying to run the following program, which calculates roots of polynomials of degree up to d with coefficients only +1 or -1, and then store it into files.
d = 20; n = 18000;
f[z_, i_] := Sum[(2 Mod[Floor[(i - 1)/2^k], 2] - 1) z^(d - k), {k, 0, d}];
Here f[z,i] gives a polynomial in z with plus or minus signs counting in binary. Say d=2, we would have
f[z,1] = -z2 - z - 1
f[z,2] = -z2 - z + 1
f[z,3] = -z2 + z - 1
f[z,4] = -z2 + z + 1
DistributeDefinitions[d, n, f]
ParallelDo[
Do[
root = N[Root[f[z, i], j]];
{a, b} = Round[n ({Re[root], Im[root]}/1.5 + 1)/2];
{i, 1, 2^d}],
{j, 1, d}]
I realise reading this probably isn't too enjoyable, but it's relatively short anyway. I would've tried to cut down to the relevant parts, but here I really have no clue what the trouble is. I'm calculating all roots of f[z,i], and then just round them to make them correspond to a point in a n by n grid, and save that data in various files.
For some reason, the memory usage in Mathematica creeps up until it fills all the memory (6 GB on this machine); then the computation continues extremely slowly; why is this?
I am not sure what is using up the memory here - my only guess was the stream of files used up memory, but that's not the case: I tried appending data to 2GB files and there was no noticeable memory usage for that. There seems to be absolutely no reason for Mathematica to be using large amounts of memory here.
For small values of d (15 for example), the behaviour is the following: I have 4 kernels running. As they all run through the ParallelDo loop (each doing a value of j at a time), the memory use increases, until they all finish going through that loop once. Then the next times they go through that loop, the memory use does not increase at all. The calculation eventually finishes and everything is fine.
Also, quite importantly, once the calculation stops, the memory use does not go back down.
If I start another calculation, the following happens:
-If the previous calculation stopped when memory use was still increasing, it continues to increase (it might take a while to start increasing again, basically to get to the same point in the computation).
-If the previous calculation stopped when memory use was not increasing, it does not increase further.
Edit: The issue seems to come from the relative complexity of f - changing it into some easier polynomial seems to fix the issue. I thought the problem might be that Mathematica remembers f[z,i] for specific values of i, but setting f[z,i] :=. just after calculating a root of f[z,i] complains that the assignment did not exist in the first place, and the memory is still used.
It's quite puzzling really, as f is the only remaining thing I can imagine taking up memory, but defining f in the inner Do loop and clearing it each time after a root is calculated does not solve the problem.

Ouch, this is a nasty one.
What's going on is that N will do caching of results in order to speed up future calculations if you need them again. Sometimes this is absolutely what you want, but sometimes it just breaks the world. Fortunately, you do have some options. One is to use the ClearSystemCache command, which does just what it said on the tin. After I ran your un-parallelized loop for a little while (before getting bored and aborting the calculation), MemoryInUse reported ~160 MiB in use. Using ClearSystemCache got that down to about 14 MiB.
One thing you should look at doing, instead of calling ClearSystemCache programmatically, is to use SetSystemOptions to change the caching behavior. You should take a look at SystemOptions["CacheOptions"] to see what the possibilities are.
EDIT: It's not terribly surprising that the caching causes a bigger problem for more complex expressions. It's got to be stashing copies of those expressions somewhere, and more complex expressions require more memory.

Related

Why are my nested for loops taking so long to compute?

I have a code that generates all of the possible combinations of 4 integers between 0 and 36.
This will be 37^4 numbers = 1874161.
My code is written in MATLAB:
i=0;
for a = 0:36
for b= 0:36
for c = 0:36
for d = 0:36
i=i+1;
combination(i,:) = [a,b,c,d];
end
end
end
end
I've tested this with using the number 3 instead of the number 36 and it worked fine.
If there are 1874161 combinations, and with An overly cautions guess of 100 clock cycles to do the additions and write the values, then if I have a 2.3GHz PC, this is:
1874161 * (1/2300000000) * 100 = 0.08148526086
A fraction of a second. But It has been running for about half an hour so far.
I did receive a warning that combination changes size every loop iteration, consider predefining its size for speed, but this can't effect it that much can it?
As #horchler suggested you need to preallocate the target array
This is because your program is not O(N^4) without preallocation. Each time you add new line to array it need to be resized, so new bigger array is created (as matlab do not know how big array it will be it probably increase only by 1 item) and then old array is copied into it and lastly old array is deleted. So when you have 10 items in array and adding 11th, then a copying of 10 items is added to iteration ... if I am not mistaken that leads to something like O(N^12) which is massively more huge
estimated as (N^4)*(1+2+3+...+N^4)=((N^4)^3)/2
Also the reallocation process is increasing in size breaching CACHE barriers slowing down even more with increasing i above each CACHE size barrier.
The only solution to this without preallocation is to store the result in linked list
Not sure Matlab has this option but that will need one/two pointer per item (32/64 bit value) which renders your array 2+ times bigger.
If you need even more speed then there are ways (probably not for Matlab):
use multi-threading for array filling is fully parallelisable
use memory block copy (rep movsd) or DMA the data is periodically repeating
You can also consider to compute the value from i on the run instead of remember the whole array, depending on the usage it can be faster in some cases...

GPGPU computation with MATLAB does not scale properly

I've been experimenting with the GPU support of Matlab (v2014a). The notebook I'm using to test my code has a NVIDIA 840M build in.
Since I'm new to GPU computing with Matlab, I started out with a few simple examples and observed a strange scalability behavior. Instead of increasing the size of my problem, I simply put a forloop around my computation. I expected the time for the computation, to scale with the number of iterations, since the problem size itself does not increase. This was also true for smaller numbers of iterations, however at a certain point the time does not scale as expected, instead I observe a huge increase in computation time. Afterwards, the problem continues to scale again as expected.
The code example started from a random walk simulation, but I tried to produce an example that is easy and still shows the problem.
Here's what my code does. I initialize two matrices as sin(alpha)and cos(alpha). Then I loop over the number of iterations from 2**1to 2**15. I then repead the computation sin(alpha)^2 and cos(alpha)^2and add them up (this was just to check the result). I perform this calculation as often as the number of iterations suggests.
function gpu_scale
close all
NP = 10000;
NT = 1000;
ALPHA = rand(NP,NT,'single')*2*pi;
SINALPHA = sin(ALPHA);
COSALPHA = cos(ALPHA);
GSINALPHA = gpuArray(SINALPHA); % move array to gpu
GCOSALPHA = gpuArray(COSALPHA);
PMAX=15;
for P = 1:PMAX;
for i=1:2^P
GX = GSINALPHA.^2;
GY = GCOSALPHA.^2;
GZ = GX+GY;
end
end
The following plot, shows the computation time in a log-log plot for the case that I always double the number of iterations. The jump occurs when doubling from 1024 to 2048 iterations.
The initial bump for two iterations might be due to initialization and is not really relevant anyhow.
I see no reason for the jump between 2**10 and 2**11 computations, since the computation time should only depend on the number of iterations.
My question: Can somebody explain this behavior to me? What is happening on the software/hardware side, that explains this jump?
Thanks in advance!
EDIT: As suggested by Divakar, I changed the way I time my code. I wasn't sure I was using gputimeit correctly. however MathWorks suggests another possible way, namely
gd= gpuDevice();
tic
% the computation
wait(gd);
Time = toc;
Using this way to measure my performance, the time is significantly slower, however I don't observe the jump in the previous plot. I added the CPU performance for comparison and keept both timings for the GPU (wait / no wait), which can be seen in the following plot
It seems, that the observed jump "corrects" the timining in the direction of the case where I used wait. If I understand the problem correctly, then the good performance in the no wait case is due to the fact, that we do not wait for the GPU to finish completely. However, then I still don't see an explanation for the jump.
Any ideas?

Transform a nested list without copying or losing precision

I am using Mathematica 7 to process a large data set. The data set is a three-dimensional array of signed integers. The three levels may be thought of as corresponding to X points per shot, Y shots per scan, and Z scans per set.
I also have a "zeroing" shot (containing X points, which are signed fractions of integers), which I would like to subtract from every shot in the data set. Afterwards, I will never again need the original data set.
How can I perform this transformation without creating new copies of the data set, or parts of it, in the process? Conceptually, the data set is located in memory, and I would like to scan through each element, and change it at that location in memory, without permanently copying it to some other memory location.
The following self-contained code captures all the aspects of what I am trying to do:
(* Create some offsetted data, and a zero data set. *)
myData = Table[Table[Table[RandomInteger[{1, 100}], {k, 500}], {j, 400}], {i, 200}];
myZero = Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];
(* Method 1 *)
myData = Table[
f1 = myData[[i]];
Table[
f2 = f1[[j]];
f2 - myZero, {j, 400}], {i, 200}];
(* Method 2 *)
Do[
Do[
myData[[i]][[j]] = myData[[i]][[j]] - myZero, {j, 400}], {i, 200}]
(* Method 3 *)
Attributes[Zeroing] = {HoldFirst};
Zeroing[x_] := Module[{},
Do[
Do[
x[[i]][[j]] = x[[i]][[j]] - myZero, {j, Length[x[[1]]]}
], {i, Length[x]}
]
];
(Note: Hat tip to Aaron Honecker for Method #3.)
On my machine (Intel Core2 Duo CPU 3.17 GHz, 4 GB RAM, 32-bit Windows 7), all three methods use roughly 1.25 GB of memory, with #2 and #3 fairing slightly better.
If I don't mind losing precision, wrapping N[ ] around the innards of myData and myZero when they're being created increases their size in memory by 150 MB initially but reduces the amount of memory required for zeroing (by methods #1-#3 above) from 1.25 GB down to just 300 MB! That's my working solution, but it would be great to know the best way of handling this problem.
Unfortunately I have little time now, so I must be concise ...
When working with large data, you need to be aware that Mathematica has a different storage format called packed arrays which is much more compact and much faster than the regular one, but only works for machine reals or integers.
Please evaluate ?Developer`*Packed* to see what functions are available for directly converting to/from them, if this doesn't happen automatically.
So the brief explanation behind why my solution is fast and memory efficient is that it uses packed arrays. I tested using Developer`PackedArrayQ that my arrays never get unpacked, and I used machine reals (I applied N[] to everything)
In[1]:= myData = N#RandomInteger[{1, 100}, {200, 400, 500}];
In[2]:= myZero =
Developer`ToPackedArray#
N#Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];
In[3]:= myData = Map[# - myZero &, myData, {2}]; // Timing
Out[3]= {1.516, Null}
Also, the operation you were asking for ("I would like to scan through each element, and change it at that location in memory") is called mapping (see Map[] or /#).
Let me start by noting that this answer must be viewed as complementary to the one by #Szabolcs, with the latter being, in my conclusion, the better option. While the solution of #Szabolcs is probably the fastest and best overall, it falls short of the original spec in that Map returns a (modified) copy of the original list, rather than "scan each element, and change it at that location in memory". Such behavior, AFAIK, is only provided by Part command. I will use his ideas (converting everything into packed arrays), to show the code that does in-memory changes to the original list:
In[5]:=
Do[myData[[All,All,i]]=myData[[All,All,i]]- myZero[[i]],
{i,Last#Dimensions#myData}];//Timing
Out[5]= {4.734,Null}
This is conceptually equivalent to method 3 listed in the question, but runs much faster because this is a partly vectorized solution and only a single loop is needed. This is however still at least order of magnitude slower than the solution of #Szabolcs.
In theory, this seems to be a classic speed/memory tradeoff: if you need speed and have some spare memory, #Szabolcs's solution is the way to go. If your memory requirements are tough, in theory this slower method would save on intermediate memory consumption (in the method of #Szabolcs, the original list is garbage-collected after the myData is assigned the result of Map, so the final memory usage is the same, but during the computation, one extra array of the size of myData is maintained by Map).
In practice, however, the memory consumption seems to not be smaller, since an extra copy of the list is for some reason maintained in the Out variable in both cases during (or right after) the computation, even when the output is suppressed (it may also be that this effect does not manifest itself in all cases). I don't quite understand this yet, but my current conclusion is that the method of #Szabolcs is just as good in terms of intermediate memory consumption as the present one based on the in-place list modifications. Therefore, his method seems the way to go in all cases, but I still decided to publish this answer as a complementary.

Finding the optimum file size combination

This is a problem I would think there is an algorithm for already - but I do not know the right words to use with google it seems :).
The problem: I would like to make a little program with which I would select a directory containing any files (but for my purpose media files, audio and video). After that I would like to enter in MB the maximum total file size sum that must not be exceeded. At this point you would hit a "Calculate best fit" button.
This button should compare all the files in the directory and provide as a result a list of the files that when put together gets most close to the max total file size without going over the limit.
This way you could find out which files to combine when burning a CD or DVD so that you will be able to use as much as possible of the disc.
I've tried to come up with the algorithm for this myself - but failed :(.
Anyone know of some nice algorithm for doing this?
Thanks in advance :)
Just for fun I tried out the accurate dynamic programming solution. Written in Python, because of my supreme confidence that you shouldn't optimise until you have to ;-)
This could provide either a start, or else a rough idea of how close you can get before resorting to approximation.
Code based on http://en.wikipedia.org/wiki/Knapsack_problem#0-1_knapsack_problem, hence the less-than-informative variable names m, W, w, v.
#!/usr/bin/python
import sys
solcount = 0
class Solution(object):
def __init__(self, items):
object.__init__(self)
#self.items = items
self.value = sum(items)
global solcount
solcount += 1
def __str__(self):
#return str(self.items) + ' = ' + str(self.value)
return ' = ' + str(self.value)
m = {}
def compute(v, w):
coord = (len(v),w)
if coord in m:
return m[coord]
if len(v) == 0 or w == 0:
m[coord] = Solution([])
return m[coord]
newvalue = v[0]
newarray = v[1:]
notused = compute(newarray, w)
if newvalue > w:
m[coord] = notused
return notused
# used = Solution(compute(newarray, w - newvalue).items + [newvalue])
used = Solution([compute(newarray, w - newvalue).value] + [newvalue])
best = notused if notused.value >= used.value else used
m[coord] = best
return best
def main():
v = [int(l) for l in open('filesizes.txt')]
W = int(sys.argv[1])
print len(v), "items, limit is", W
print compute(v, W)
print solcount, "solutions computed"
if __name__ == '__main__':
main()
For simplicity I'm just considering the file sizes: once you have the list of sizes that you want to use, you can find some filenames with those sizes by searching through a list, so there's no point tangling up filenames in the core, slow part of the program. I'm also expressing everything in multiples of the block size.
As you can see, I've commented out the code that gives the actual solution (as opposed to the value of the solution). That was to save memory - the proper way to store the list of files used isn't one list in each Solution, it's to have each solution point back to the Solution it was derived from. You can then calculate the list of filesizes at the end by going back through the chain, outputting the difference between the values at each step.
With a list of 100 randomly-generated file sizes in the range 2000-6000 (I'm assuming 2k blocks, so that's files of size 4-12MB), this solves for W=40K in 100 seconds on my laptop. In doing so it computes 2.6M of a possible 4M solutions.
Complexity is O(W*n), where n is the number of files. This does not contradict the fact that the problem is NP-complete. So I am at least approaching a solution, and this is just in unoptimised Python.
Clearly some optimisation is now required, because actually it needs to be solved for W=4M (8GB DVD) and however many files you have (lets say a few thousand). Presuming that the program is allowed to take 15 minutes (comparable to the time required to write a DVD), that means performance is currently short by a factor of roughly 10^3. So we have a problem that's quite hard to solve quickly and accurately on a PC, but not beyond the bounds of technology.
Memory use is the main concern, since once we start hitting swap we'll slow down, and if we run out of virtual address space we're in real trouble because we have to implement our own storage of Solutions on disk. My test run peaks at 600MB. If you wrote the code in C on a 32-bit machine, each "solution" has a fixed size of 8 bytes. You could therefore generate a massive 2-D array of them without doing any memory allocation in the loop, but in 2GB of RAM you could only handle W=4M and n=67. Oops - DVDs are out. It could very nearly solve for 2-k blocksize CDs, though: W=350k gives n=766.
Edit: MAK's suggestion to compute iteratively bottom-up, rather than recursively top-down, should massively reduce the memory requirement. First calculate m(1,w) for all 0 <= w <= W. From this array, you can calculate m(2,w) for all 0 <= w <= W. Then you can throw away all the m(1,w) values: you won't need them to calculate m(3,w) etc.
By the way, I suspect that actually the problem you want to solve might be the bin packing problem, rather than just the question of how to get the closest possible to filling a DVD. That's if you have a bunch of files, you want to write them all to DVD, using as few DVDs as possible. There are situations where solving the bin packing problem is very easy, but solving this problem is hard. For example, suppose that you have 8GB disks, and 15GB of small files. It's going to take some searching to find the closest possible match to 8GB, but the bin-packing problem would be trivially solved just by putting roughly half the files on each disk - it doesn't matter exactly how you divide them because you're going to waste 1GB of space whatever you do.
All that said, there are extremely fast heuristics that give decent results much of the time. Simplest is to go through the list of files (perhaps in decreasing order of size), and include each file if it fits, exclude it otherwise. You only need to fall back to anything slow if fast approximate solutions aren't "good enough", for your choice of "enough".
This is, as other pointed out, the Knapsack Problem, which is a combinatorial optimization problem. It means that you look for some subset or permutation of a set which minimizes (or maximizes) a certain cost. Another well known such problem is the Traveling Salesman Problem.
Such problems are usually very hard to solve. But if you're interested in almost optimal solutions, you can use non-deterministic algorithms, like simulated annealing. You most likely won't get the optimal solution, but a nearly optimal one.
This link explains how simulated annealing can solve the Knapsack Problem, and therefore should be interesting to you.
Sounds like you have a hard problem there. This problem is well known, but no efficient solutions (can?) exist.
Other then the obvious way of trying all permuations of objects with size < bucket, you could also have a look at the implementation of the bucketizer perl module, which does exactly what you are asking for. I'm not sure what it does exactly, but the manual mentions that there is one "brute force" way, so I'm assuming there must also be some kind of optimization.
Thank you for your answers.
I looked into this problem more now with the guidance of the given answers. Among other things I found this webpage, http://www.mathmaniacs.org/lessons/C-subsetsum/index.html. It tells about the subset sum problem, which I believe is the problem I described here.
One sentence from the webpage is this:
--
You may want to point out that a number like 2300 is so large that even a computer counting at a speed of over a million or billion each second, would not reach 2300 until long after our sun had burned out.
--
Personally I would have more use for this algorithm when comparing a larger amount of file sizes than let's say 10 or less as it is somehow easy to reach the probably biggest sum just by trial and error manually if the number of files is low.
A CD with mp3:s can easily have 100 mp3s and a DVD a lot more, which leads to the sun burning out before I have the answer :).
Randomly trying to find the optimum sum can apparently get you pretty close but it can never be guaranteed to be the optimum answer and can also with bad luck be quite far away. Brute-force is the only real way it seems to get the optimum answer and that would take way too long.
So I guess I just continue estimating manually a good combination of files to burn on CDs and DVDs. :)
Thanks for the help. :)
If you're looking for a reasonable heuristic, and the objective is to minimize the number of disks required, here's a simple one you might consider. It's similar to one I used recently for a job-shop problem. I was able to compare it to known optima, and found it provided allocations that were either optimal or extremely close to being optimal.
Suppose B is the size of all files combined and C is the capacity of each disk. Then you will need at least n = roundup(B/C) disks. Try to fit all the files on n disks. If you are able to do so, you're finished, and have an optimal solution. Otherwise, try to fit all the files on n+1 disks. If you are able to do so, you have a heuristic solution; otherwise try to fit the files on n+2 disks, and so on, until you are able to do so.
For any given allocation of files to disks below (which may exceed some disk capacities), let si be the combined size of files allocated to disk i, and t = max si. We are finished when t <=C.
First, order (and index) the files largest to smallest.
For m >= n disks,
Allocate the files to the disks in a back-in-forth way: 1->1, 2->2, ... m->m, m+1>m-1, m+2->m-2, ... 2m->1, 2m+1->2, 2m+2->3 ... 3m->m, 3m+1->m-1, and so on until all files are allocated, with no regard to disk capacity. If t <= C we are finished (and the allocation is optimal if m = n); else go to #2.
Attempt to reduce t by moving a file from a disk i with si = t to another disk, without increasing t. Continue doing this until t <= C, in which case we are finished (and the allocation is optimal if m = n), or t cannot be reduced further, in which case go to #3.
Attempt to reduce t by performing pairwise exchanges between disks. Continue doing this until t <= C, in which case we are finished (and the allocation is optimal if m = n), or t cannot be reduced further with pairwise exchanges. In the latter case, repeat #2, unless no improvement was made the last time #2 was repeated, in which case increment m by one and repeat #1.
In #2 and #3 there are of course different ways to order possible reallocations and pairwise exchanges.

Best way to calculate the result of a formula?

I currently have an application which can contain 100s of user defined formulae. Currently, I use reverse polish notation to perform the calculations (pushing values and variables on to a stack, then popping them off the stack and evaluating). What would be the best way to start parallelizing this process? Should I be looking at a functional language?
The calculations are performed on arrays of numbers so for example a simple A+B could actually mean 100s of additions. I'm currently using Delphi, but this is not a requirement going forward. I'll use the tool most suited to the job. Formulae may also be dependent on each other So we may have one formula C=A+B and a second one D=C+A for example.
Let's assume your formulae (equations) are not cyclic, as otherwise you cannot "just" evaluate them. If you have vectorized equations like A = B + C where A, B and C are arrays, let's conceptually split them into equations on the components, so that if the array size is 5, this equation is split into
a1 = b1 + c1
a2 = b2 + c2
...
a5 = b5 + c5
Now assuming this, you have a large set of equations on simple quantities (whether integer, rational or something else).
If you have two equations E and F, let's say that F depends_on E if the right-hand side of F mentions the left-hand side of E, for example
E: a = b + c
F: q = 2*a + y
Now to get towards how to calculate this, you could always use randomized iteration to solve this (this is just an intermediate step in the explanation), following this algorithm:
1 while (there is at least one equation which has not been computed yet)
2 select one such pending equation E so that:
3 for every equation D such that E depends_on D:
4 D has been already computed
5 calculate the left-hand side of E
This process terminates with the correct answer regardless on how you make your selections on line // 2. Now the cool thing is that it also parallelizes easily. You can run it in an arbitrary number of threads! What you need is a concurrency-safe queue which holds those equations whose prerequisites (those the equations depend on) have been computed but which have not been computed themselves yet. Every thread pops out (thread-safely) one equation from this queue at a time, calculates the answer, and then checks if there are now new equations so that all their prerequisites have been computed, and then adds those equations (thread-safely) to the work queue. Done.
Without knowing more, I would suggest taking a SIMD style approach if possible. That is, create threads to compute all formulas for a single data set. Trying to divide the computation of formulas to parallelise them wouldn't yield much speed improvement as the logic required to be able to split up the computations into discrete units suitable for threading would be hard to write and harder to get right, the overhead would cancel out any speed gains. It would also suffer quickly from diminishing returns.
Now, if you've got a set of formulas that are applied to many sets of data then the parallelisation becomes easier and would scale better. Each thread does all computations for one set of data. Create one thread per CPU core and set its affinity to each core. Each thread instantiates one instance of the formula evaluation code. Create a supervisor which loads a single data set and passes it an idle thread. If no threads are idle, wait for the first thread to finish processing its data. When all data sets are processed and all threads have finished, then exit. Using this method, there's no advantage to having more threads than there are cores on the CPU as thread switching is slow and will have a negative effect on overall speed.
If you've only got one data set then it is not a trivial task. It would require parsing the evaluation tree for branches without dependencies on other branches and farming those branches to separate threads running on each core and waiting for the results. You then get problems synchronizing the data and ensuring data coherency.

Resources