Memory efficient parallelization of loop using MPI and mpi4py

Memory efficient parallelization of loop using MPI and mpi4py - parallel-processing

I am new to parallel programming and MPI, and I am stuck on this possibly easy problem ...
I am trying to write a code which evolve a system of N gravitationally interacting particles forward in time. This is quite easy using a naive algorithm, which is what I have done. But now I want to parallelize my code. Specifically I am writing in Python using mpi4py. A simplified (and heavily optimizable, but that is not the point), non-parallel implementation would look something like this:
# pos and vel are arrays storing the positions and velocities of all particles
dt = 0.01 # The time step
for i in range(N):
for j in range(N):
if i == j:
continue
# Calculate force between i'th and j'th particle
r = pos[j] - pos[i]
force -= r/norm(r)**3
# Update velocity
vel[i] -= dt*force
# Now use vel to update pos ...
How do I go about parallelizing this algorithm? Since the number of particles N could be very large, I want to store pos and vel only on the root process, to save memory. My initial thought was to parallelize the i-loop, but every process still needs access to pos and vel in their entirety! That is, a Scatter/Gather scheme is not helpful.
Am I forced to have a copy of pos and vel in memory for every process, or are there some way out of this?
A simple way out would be to share the memory containing pos and vel across all processes, without making duplicates. I do not know if this is possible with MPI (and specifically mpi4py).
Any help would be gratefully accepted!

I think the usual way to go about this is using domain decomposition. You divide the particles up into as many domains as you like (usually one per MPI process or one per core if you're doing multithreading). Then you use ghost regions and MPI communication to define the interactions between different domains.
Giving you a bigger answer than that is pretty involved so I'd encourage you to go check out those ideas and come back with specific problems.

Related

Optimization of time-frequency distribution - OpenCL. Sequence of small FFTs

Currently i'm trying to implement the time-frequency analysis on the ATI GPU device. Basically, what I need is to run FFTs within the small(256-ish) window moving through the initial array, and store the results of each calculation in a matrix [window_size, analysis_size].
I'm using clFFT library, with the following buffers structure:
data_buffer ~ complex<float>[analysis_size]
out_buffer ~ complex<float>[window_size * (analysis_size-window_size)]
in_buffers ~ overlapped window_size subbuffers within data_buffer, for each calculation
out_buffers ~ window_size subbuffers within out_buffer, for each calculation. No overlapping.
After the writing data(clEnqueueWriteBuffer) to the in_buffer(when all in_buffers should be filled) I simply run in a cycle the desired amount of clfftEnqueueTransforms with in_buffers[i], out_buffers[i] as parameters. After that I read everything from the out_buffer with a single clEnqueueReadBuffer.
All that bothering with subbuffers was supposed to decrease the data transfer overhead, but still...
IPP's implementation of FFT makes the same thing twice faster, which is sort of confusing. What am I doing wrong? Is there any possible optimizations, or maybe fundamental mistakes? Thanks for any help.
My code is available through that link.

Hiding communication in Matrix Vector Product with MPI

I have to solve a huge linear equation for multiple right sides (Let's say 20 to 200). The Matrix is stored in a sparse format and distributed over multiple MPI nodes (Let's say 16 to 64). I run a CG solver on the rank 0 node. It's not possible to solve the linear equation directly, because the system matrix would be dense (Sys = A^T * S * A).
The basic Matrix-Vector multiplication is implemented as:
broadcast x
y = A_part * x
reduce y
While the collective operations are reasonably fast (OpenMPI seems to use a binary tree like communication pattern + Infiniband), it still accounts for a quite large part of the runtime. For performance reasons we already calculate 8 right sides per iteration (Basicly SpM * DenseMatrix, just to be complete).
I'm trying to come up with a good scheme to hide the communication latency, but I did not have a good idea yet. I also try to refrain from doing 1:n communication, although I did not yet measure if scaling would be a problem.
Any suggestions are welcome!

If your matrix is already distributed, would it be possible to use a distributed sparse linear solver instead of running it only on rank 0 and then broadcasting the result (if I'm reading your description correctly..). There's plenty of libraries for that, e.g. SuperLU_DIST, MUMPS, PARDISO, Aztec(OO), etc.
The "multiple rhs" optimization is supported by at least SuperLU and MUMPS (haven't checked the others, but I'd be VERY surprised if they didn't support it!), since they solve AX=B where X and B are matrices with potentially > 1 column. That is, each "rhs" is stored as a column vector in B.

If you don't need to have the results of an old right-hand-side before starting the next run you could try to use non-blocking communication (ISend, IRecv) and communicate the result while calculating the next right-hand-side already.
But make sure you call MPI_Wait before reading the content of the communicated array, in order to be sure you're not reading "old" data.
If the matrices are big enough (i.e. it takes long enough to calculate the matrix-product) you don't have any communication delay at all with this approach.

Why is MATLAB so fast in matrix multiplication?

I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x2048 and even bigger matrices are almost instantly multiplied.
1024x1024 2048x2048 4096x4096
--------- --------- ---------
CUDA C (ms) 43.11 391.05 3407.99
C++ (ms) 6137.10 64369.29 551390.93
C# (ms) 10509.00 300684.00 2527250.00
Java (ms) 9149.90 92562.28 838357.94
MATLAB (ms) 75.01 423.10 3133.90
Only CUDA is competitive, but I thought that at least C++ will be somewhat close and not 60 times slower. I also don't know what to think about the C# results. The algorithm is just the same as C++ and Java, but there's a giant jump 2048 from 1024.
How is MATLAB performing matrix multiplication so fast?
C++ Code:
float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
for (int k = 0; k < rozmer; k++)
{
temp = 0;
for (int m = 0; m < rozmer; m++)
{
temp = temp + matice1[j][m] * matice2[m][k];
}
matice3[j][k] = temp;
}
}
timer.stop();

This kind of question is recurring and should be answered more clearly than "MATLAB uses highly optimized libraries" or "MATLAB uses the MKL" for once on Stack Overflow.
History:
Matrix multiplication (together with Matrix-vector, vector-vector multiplication and many of the matrix decompositions) is (are) the most important problems in linear algebra. Engineers have been solving these problems with computers since the early days.
I'm not an expert on the history, but apparently back then, everybody just rewrote his FORTRAN version with simple loops. Some standardization then came along, with the identification of "kernels" (basic routines) that most linear algebra problems needed in order to be solved. These basic operations were then standardized in a specification called: Basic Linear Algebra Subprograms (BLAS). Engineers could then call these standard, well-tested BLAS routines in their code, making their work much easier.
BLAS:
BLAS evolved from level 1 (the first version which defined scalar-vector and vector-vector operations) to level 2 (vector-matrix operations) to level 3 (matrix-matrix operations), and provided more and more "kernels" so standardized more and more of the fundamental linear algebra operations. The original FORTRAN 77 implementations are still available on Netlib's website.
Towards better performance:
So over the years (notably between the BLAS level 1 and level 2 releases: early 80s), hardware changed, with the advent of vector operations and cache hierarchies. These evolutions made it possible to increase the performance of the BLAS subroutines substantially. Different vendors then came along with their implementation of BLAS routines which were more and more efficient.
I don't know all the historical implementations (I was not born or a kid back then), but two of the most notable ones came out in the early 2000s: the Intel MKL and GotoBLAS. Your Matlab uses the Intel MKL, which is a very good, optimized BLAS, and that explains the great performance you see.
Technical details on Matrix multiplication:
So why is Matlab (the MKL) so fast at dgemm (double-precision general matrix-matrix multiplication)? In simple terms: because it uses vectorization and good caching of data. In more complex terms: see the article provided by Jonathan Moore.
Basically, when you perform your multiplication in the C++ code you provided, you are not at all cache-friendly. Since I suspect you created an array of pointers to row arrays, your accesses in your inner loop to the k-th column of "matice2": matice2[m][k] are very slow. Indeed, when you access matice2[0][k], you must get the k-th element of the array 0 of your matrix. Then in the next iteration, you must access matice2[1][k], which is the k-th element of another array (the array 1). Then in the next iteration you access yet another array, and so on... Since the entire matrix matice2 can't fit in the highest caches (it's 8*1024*1024 bytes large), the program must fetch the desired element from main memory, losing a lot of time.
If you just transposed the matrix, so that accesses would be in contiguous memory addresses, your code would already run much faster because now the compiler can load entire rows in the cache at the same time. Just try this modified version:
timer.start();
float temp = 0;
//transpose matice2
for (int p = 0; p < rozmer; p++)
{
for (int q = 0; q < rozmer; q++)
{
tempmat[p][q] = matice2[q][p];
}
}
for(int j = 0; j < rozmer; j++)
{
for (int k = 0; k < rozmer; k++)
{
temp = 0;
for (int m = 0; m < rozmer; m++)
{
temp = temp + matice1[j][m] * tempmat[k][m];
}
matice3[j][k] = temp;
}
}
timer.stop();
So you can see how just cache locality increased your code's performance quite substantially. Now real dgemm implementations exploit that to a very extensive level: They perform the multiplication on blocks of the matrix defined by the size of the TLB (Translation lookaside buffer, long story short: what can effectively be cached), so that they stream to the processor exactly the amount of data it can process. The other aspect is vectorization, they use the processor's vectorized instructions for optimal instruction throughput, which you can't really do from your cross-platform C++ code.
Finally, people claiming that it's because of Strassen's or Coppersmith–Winograd algorithm are wrong, both these algorithms are not implementable in practice, because of hardware considerations mentioned above.

Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:
>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.
MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray version uses MAGMA.
Update using R2014a on a machine with a Tesla K20c, and the new timeit and gputimeit functions:
>> A = rand(1024); gA = gpuArray(A);
>> timeit(#()A*A)
ans =
0.0324
>> gputimeit(#()gA*gA)
ans =
0.0022
Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:
>> timeit(#()A*A)
ans =
0.0229
>> gputimeit(#()gA*gA)
ans =
4.8019e-04
(NB: at some point (I forget when exactly) gpuArray switched from MAGMA to cuBLAS - MAGMA is still used for some gpuArray operations though)
Update using R2022a on a WIN64 machine with 32 physical cores and an A100 GPU:
>> timeit(#()A*A)
ans =
0.0076
>> gputimeit(#()gA*gA)
ans =
2.5344e-04

This is why. MATLAB doesn't perform a naive matrix multiplication by looping over every single element the way you did in your C++ code.
Of course I'm assuming that you just used C=A*B instead of writing a multiplication function yourself.

Matlab incorporated LAPACK some time ago, so I assume their matrix multiplication uses something at least that fast. LAPACK source code and documentation is readily available.
You might also look at Goto and Van De Geijn's paper "Anatomy of High-Performance Matrix
Multiplication" at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.1785&rep=rep1&type=pdf

The answer is LAPACK and BLAS libraries make MATLAB blindingly fast at matrix operations, not any proprietary code by the folks at MATLAB.
Use the LAPACK and/or BLAS libraries in your C++ code for matrix operations and you should get similar performance as MATLAB. These libraries should be freely available on any modern system and parts were developed over decades in academia. Note that there are multiple implementations, including some closed source such as Intel MKL.
A discussion of how BLAS gets high performance is available here.
BTW, it's a serious pain in my experience to call LAPACK libraries directly from c (but worth it). You need to read the documentation VERY precisely.

When doing matrix multiplying, you use naive multiplication method which takes time of O(n^3).
There exist matrix multiplication algorithm which takes O(n^2.4). Which means that at n=2000 your algorithm requires ~100 times as much computation as the best algorithm.
You should really check the wikipedia page for matrix multiplication for further information on the efficient ways to implement it.

Depending on your version of Matlab, I believe it might be using your GPU already.
Another thing; Matlab keeps track of many properties of your matrix; wether its diagonal, hermetian, and so forth, and specializes its algorithms based thereon. Maybe its specializing based on the zero matrix you are passing it, or something like that? Maybe it is caching repeated function calls, which messes up your timings? Perhaps it optimizes out repeated unused matrix products?
To guard against such things happening, use a matrix of random numbers, and make sure you force execution by printing the result to screen or disk or somesuch.

The general answer to "Why is matlab faster at doing xxx than other programs" is that matlab has a lot of built in, optimized functions.
The other programs that are used often do not have these functions so people apply their own creative solutions, which are suprisingly slower than professionally optimized code.
This can be interpreted in two ways:
1) The common/theoretical way: Matlab is not significantly faster, you are just doing the benchmark wrong
2) The realistic way: For this stuff Matlab is faster in practice because languages as c++ are just too easily used in ineffective ways.

MATLAB uses a highly optimized implementation of LAPACK from Intel known as Intel Math Kernel Library (Intel MKL) - specifically the dgemm function. The speed This library takes advantage of processor features including SIMD instructions and multi-core processors. They don't document which specific algorithm they use. If you were to call Intel MKL from C++ you should see similar performance.
I am not sure what library MATLAB uses for GPU multiplication but probably something like nVidia CUBLAS.

The sharp contrast is not only due to Matlab's amazing optimization (as discussed by many other answers already), but also in the way you formulated matrix as an object.
It seems like you made matrix a list of lists? A list of lists contains pointers to lists which then contain your matrix elements. The locations of the contained lists are assigned arbitrarily. As you are looping over your first index (row number?), the time of memory access is very significant. In comparison, why don't you try implement matrix as a single list/vector using the following method?
#include <vector>
struct matrix {
matrix(int x, int y) : n_row(x), n_col(y), M(x * y) {}
int n_row;
int n_col;
std::vector<double> M;
double &operator()(int i, int j);
};
And
double &matrix::operator()(int i, int j) {
return M[n_col * i + j];
}
The same multiplication algorithm should be used so that the number of flop is the same. (n^3 for square matrices of size n)
I'm asking you to time it so that the result is comparable to what you had earlier (on the same machine). With the comparison, you will show exactly how significant memory access time can be!

It's slow in C++ because you are not using multithreading. Essentially, if A = B C, where they are all matrices, the first row of A can be computed independently from the 2nd row, etc. If A, B, and C are all n by n matrices, you can speed up the multiplication by a factor of n^2, as
a_{i,j} = sum_{k} b_{i,k} c_{k,j}
If you use, say, Eigen [ http://eigen.tuxfamily.org/dox/GettingStarted.html ], multithreading is built-in and the number of threads is adjustable.

Because MATLAB is a programming language at first developed for numerical linear algebra (matrix manipulations), which has libraries especially developed for matrix multiplications. And now MATLAB can also use the GPUs (Graphics processing unit) for this additionally.
And if we look at your computation results:
1024x1024 2048x2048 4096x4096
--------- --------- ---------
CUDA C (ms) 43.11 391.05 3407.99
C++ (ms) 6137.10 64369.29 551390.93
C# (ms) 10509.00 300684.00 2527250.00
Java (ms) 9149.90 92562.28 838357.94
MATLAB (ms) 75.01 423.10 3133.90
then we can see that not only MATLAB is so fast in matrix multiplication: CUDA C (programming language from NVIDIA) has some better results than MATLAB. CUDA C has also libraries especially developed for matrix multiplications and it uses the GPUs.
Short history of MATLAB
Cleve Moler, the chairman of the computer science department at the University of New Mexico, started developing MATLAB in the late 1970s. He designed it to give his students access to LINPACK (a software library for performing numerical linear algebra) and EISPACK (is a software library for numerical computation of linear algebra) without them having to learn Fortran. It soon spread to other universities and found a strong audience within the applied mathematics community. Jack Little, an engineer, was exposed to it during a visit Moler made to Stanford University in 1983. Recognizing its commercial potential, he joined with Moler and Steve Bangert. They rewrote MATLAB in C and founded MathWorks in 1984 to continue its development. These rewritten libraries were known as JACKPAC. In 2000, MATLAB was rewritten to use a newer set of libraries for matrix manipulation, LAPACK (is a standard software library for numerical linear algebra).
Source
What is CUDA C
CUDA C uses also libraries especially developed for matrix multiplications like OpenGL (Open Graphics Library). It uses also GPU and Direct3D (on MS Windows).
The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming. Also, CUDA supports programming frameworks such as OpenACC and OpenCL.
Example of CUDA processing flow:
Copy data from main memory to GPU memory
CPU initiates the GPU compute kernel
GPU's CUDA cores execute the kernel in parallel
Copy the resulting data from GPU memory to main memory
Comparing CPU and GPU Execution Speeds
We ran a benchmark in which we measured the amount of time it took to execute 50 time steps for grid sizes of 64, 128, 512, 1024, and 2048 on an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU.
For a grid size of 2048, the algorithm shows a 7.5x decrease in compute time from more than a minute on the CPU to less than 10 seconds on the GPU. The log scale plot shows that the CPU is actually faster for small grid sizes. As the technology evolves and matures, however, GPU solutions are increasingly able to handle smaller problems, a trend that we expect to continue.
Source
From introduction for CUDA C Programming Guide:
Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2.
Figure 1. Floating-Point Operations per Second for the CPU and GPU
Figure 2. Memory Bandwidth for the CPU and GPU
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3.
Figure 3. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.
Source
Advanced reading
GPUs (Graphics processing unit)
MATLAB
CUDA C Programming Guide
Using GPUs in MATLAB
Basic Linear Algebra Subprograms (BLAS)
Anatomy of High-Performance Matrix Multiplication, from Kazushige Goto and Robert A. Van De Geijn
Some interesting facs
I've written C++ matrix multiplication that is as fast as Matlab's but it took some care. (Before Matlab was using GPUs for this).
Сitation from this answer.

Time delay estimation between two audio signals

I have two audio recordings of a same signal by 2 different microphones (for example, in a WAV format), but one of them is recorded with delay, for example, several seconds.
It's easy to identify such a delay visually when viewing these signals in some kind of waveform viewer - i.e. just spotting first visible peak in every signal and ensuring that they're the same shape:
(source: greycat.ru)
But how do I do it programmatically - find out what this delay (t) is? Two digitized signals are slightly different (because microphones are different, were at different positions, due to ADC setups, etc).
I've digged around a bit and found out that this problem is usually called "time-delay estimation" and it has myriads of approaches to it - for example, one of them.
But are there any simple and ready-made solutions, such as command-line utility, library or straight-forward algorithm available?
Conclusion: I've found no simple implementation and done a simple command-line utility myself - available at https://bitbucket.org/GreyCat/calc-sound-delay (GPLv3-licensed). It implements a very simple search-for-maximum algorithm described at Wikipedia.

The technique you're looking for is called cross correlation. It's a very simple, if somewhat compute intensive technique which can be used for solving various problems, including measuring the time difference (aka lag) between two similar signals (the signals do not need to be identical).
If you have a reasonable idea of your lag value (or at least the range of lag values that are expected) then you can reduce the total amount of computation considerably. Ditto if you can put a definite limit on how much accuracy you need.

Having had the same problem and without success to find a tool to sync the start of video/audio recordings automatically,
I decided to make syncstart (github).
It is a command line tool. The basic code behind it is this:
import numpy as np
from scipy import fft
from scipy.io import wavfile
r1,s1 = wavfile.read(in1)
r2,s2 = wavfile.read(in2)
assert r1==r2, "syncstart normalizes using ffmpeg"
fs = r1
ls1 = len(s1)
ls2 = len(s2)
padsize = ls1+ls2+1
padsize = 2**(int(np.log(padsize)/np.log(2))+1)
s1pad = np.zeros(padsize)
s1pad[:ls1] = s1
s2pad = np.zeros(padsize)
s2pad[:ls2] = s2
corr = fft.ifft(fft.fft(s1pad)*np.conj(fft.fft(s2pad)))
ca = np.absolute(corr)
xmax = np.argmax(ca)
if xmax > padsize // 2:
file,offset = in2,(padsize-xmax)/fs
else:
file,offset = in1,xmax/fs

A very straight forward thing todo is just to check if the peaks exceed some threshold, the time between high-peak on line A and high-peak on line B is probably your delay. Just try tinkering a bit with the thresholds and if the graphs are usually as clear as the picture you posted, then you should be fine.

Finding the optimum file size combination

This is a problem I would think there is an algorithm for already - but I do not know the right words to use with google it seems :).
The problem: I would like to make a little program with which I would select a directory containing any files (but for my purpose media files, audio and video). After that I would like to enter in MB the maximum total file size sum that must not be exceeded. At this point you would hit a "Calculate best fit" button.
This button should compare all the files in the directory and provide as a result a list of the files that when put together gets most close to the max total file size without going over the limit.
This way you could find out which files to combine when burning a CD or DVD so that you will be able to use as much as possible of the disc.
I've tried to come up with the algorithm for this myself - but failed :(.
Anyone know of some nice algorithm for doing this?
Thanks in advance :)

Just for fun I tried out the accurate dynamic programming solution. Written in Python, because of my supreme confidence that you shouldn't optimise until you have to ;-)
This could provide either a start, or else a rough idea of how close you can get before resorting to approximation.
Code based on http://en.wikipedia.org/wiki/Knapsack_problem#0-1_knapsack_problem, hence the less-than-informative variable names m, W, w, v.
#!/usr/bin/python
import sys
solcount = 0
class Solution(object):
def __init__(self, items):
object.__init__(self)
#self.items = items
self.value = sum(items)
global solcount
solcount += 1
def __str__(self):
#return str(self.items) + ' = ' + str(self.value)
return ' = ' + str(self.value)
m = {}
def compute(v, w):
coord = (len(v),w)
if coord in m:
return m[coord]
if len(v) == 0 or w == 0:
m[coord] = Solution([])
return m[coord]
newvalue = v[0]
newarray = v[1:]
notused = compute(newarray, w)
if newvalue > w:
m[coord] = notused
return notused
# used = Solution(compute(newarray, w - newvalue).items + [newvalue])
used = Solution([compute(newarray, w - newvalue).value] + [newvalue])
best = notused if notused.value >= used.value else used
m[coord] = best
return best
def main():
v = [int(l) for l in open('filesizes.txt')]
W = int(sys.argv[1])
print len(v), "items, limit is", W
print compute(v, W)
print solcount, "solutions computed"
if __name__ == '__main__':
main()
For simplicity I'm just considering the file sizes: once you have the list of sizes that you want to use, you can find some filenames with those sizes by searching through a list, so there's no point tangling up filenames in the core, slow part of the program. I'm also expressing everything in multiples of the block size.
As you can see, I've commented out the code that gives the actual solution (as opposed to the value of the solution). That was to save memory - the proper way to store the list of files used isn't one list in each Solution, it's to have each solution point back to the Solution it was derived from. You can then calculate the list of filesizes at the end by going back through the chain, outputting the difference between the values at each step.
With a list of 100 randomly-generated file sizes in the range 2000-6000 (I'm assuming 2k blocks, so that's files of size 4-12MB), this solves for W=40K in 100 seconds on my laptop. In doing so it computes 2.6M of a possible 4M solutions.
Complexity is O(W*n), where n is the number of files. This does not contradict the fact that the problem is NP-complete. So I am at least approaching a solution, and this is just in unoptimised Python.
Clearly some optimisation is now required, because actually it needs to be solved for W=4M (8GB DVD) and however many files you have (lets say a few thousand). Presuming that the program is allowed to take 15 minutes (comparable to the time required to write a DVD), that means performance is currently short by a factor of roughly 10^3. So we have a problem that's quite hard to solve quickly and accurately on a PC, but not beyond the bounds of technology.
Memory use is the main concern, since once we start hitting swap we'll slow down, and if we run out of virtual address space we're in real trouble because we have to implement our own storage of Solutions on disk. My test run peaks at 600MB. If you wrote the code in C on a 32-bit machine, each "solution" has a fixed size of 8 bytes. You could therefore generate a massive 2-D array of them without doing any memory allocation in the loop, but in 2GB of RAM you could only handle W=4M and n=67. Oops - DVDs are out. It could very nearly solve for 2-k blocksize CDs, though: W=350k gives n=766.
Edit: MAK's suggestion to compute iteratively bottom-up, rather than recursively top-down, should massively reduce the memory requirement. First calculate m(1,w) for all 0 <= w <= W. From this array, you can calculate m(2,w) for all 0 <= w <= W. Then you can throw away all the m(1,w) values: you won't need them to calculate m(3,w) etc.
By the way, I suspect that actually the problem you want to solve might be the bin packing problem, rather than just the question of how to get the closest possible to filling a DVD. That's if you have a bunch of files, you want to write them all to DVD, using as few DVDs as possible. There are situations where solving the bin packing problem is very easy, but solving this problem is hard. For example, suppose that you have 8GB disks, and 15GB of small files. It's going to take some searching to find the closest possible match to 8GB, but the bin-packing problem would be trivially solved just by putting roughly half the files on each disk - it doesn't matter exactly how you divide them because you're going to waste 1GB of space whatever you do.
All that said, there are extremely fast heuristics that give decent results much of the time. Simplest is to go through the list of files (perhaps in decreasing order of size), and include each file if it fits, exclude it otherwise. You only need to fall back to anything slow if fast approximate solutions aren't "good enough", for your choice of "enough".

This is, as other pointed out, the Knapsack Problem, which is a combinatorial optimization problem. It means that you look for some subset or permutation of a set which minimizes (or maximizes) a certain cost. Another well known such problem is the Traveling Salesman Problem.
Such problems are usually very hard to solve. But if you're interested in almost optimal solutions, you can use non-deterministic algorithms, like simulated annealing. You most likely won't get the optimal solution, but a nearly optimal one.
This link explains how simulated annealing can solve the Knapsack Problem, and therefore should be interesting to you.

Sounds like you have a hard problem there. This problem is well known, but no efficient solutions (can?) exist.

Other then the obvious way of trying all permuations of objects with size < bucket, you could also have a look at the implementation of the bucketizer perl module, which does exactly what you are asking for. I'm not sure what it does exactly, but the manual mentions that there is one "brute force" way, so I'm assuming there must also be some kind of optimization.

Thank you for your answers.
I looked into this problem more now with the guidance of the given answers. Among other things I found this webpage, http://www.mathmaniacs.org/lessons/C-subsetsum/index.html. It tells about the subset sum problem, which I believe is the problem I described here.
One sentence from the webpage is this:
--
You may want to point out that a number like 2300 is so large that even a computer counting at a speed of over a million or billion each second, would not reach 2300 until long after our sun had burned out.
--
Personally I would have more use for this algorithm when comparing a larger amount of file sizes than let's say 10 or less as it is somehow easy to reach the probably biggest sum just by trial and error manually if the number of files is low.
A CD with mp3:s can easily have 100 mp3s and a DVD a lot more, which leads to the sun burning out before I have the answer :).
Randomly trying to find the optimum sum can apparently get you pretty close but it can never be guaranteed to be the optimum answer and can also with bad luck be quite far away. Brute-force is the only real way it seems to get the optimum answer and that would take way too long.
So I guess I just continue estimating manually a good combination of files to burn on CDs and DVDs. :)
Thanks for the help. :)

If you're looking for a reasonable heuristic, and the objective is to minimize the number of disks required, here's a simple one you might consider. It's similar to one I used recently for a job-shop problem. I was able to compare it to known optima, and found it provided allocations that were either optimal or extremely close to being optimal.
Suppose B is the size of all files combined and C is the capacity of each disk. Then you will need at least n = roundup(B/C) disks. Try to fit all the files on n disks. If you are able to do so, you're finished, and have an optimal solution. Otherwise, try to fit all the files on n+1 disks. If you are able to do so, you have a heuristic solution; otherwise try to fit the files on n+2 disks, and so on, until you are able to do so.
For any given allocation of files to disks below (which may exceed some disk capacities), let si be the combined size of files allocated to disk i, and t = max si. We are finished when t <=C.
First, order (and index) the files largest to smallest.
For m >= n disks,
Allocate the files to the disks in a back-in-forth way: 1->1, 2->2, ... m->m, m+1>m-1, m+2->m-2, ... 2m->1, 2m+1->2, 2m+2->3 ... 3m->m, 3m+1->m-1, and so on until all files are allocated, with no regard to disk capacity. If t <= C we are finished (and the allocation is optimal if m = n); else go to #2.
Attempt to reduce t by moving a file from a disk i with si = t to another disk, without increasing t. Continue doing this until t <= C, in which case we are finished (and the allocation is optimal if m = n), or t cannot be reduced further, in which case go to #3.
Attempt to reduce t by performing pairwise exchanges between disks. Continue doing this until t <= C, in which case we are finished (and the allocation is optimal if m = n), or t cannot be reduced further with pairwise exchanges. In the latter case, repeat #2, unless no improvement was made the last time #2 was repeated, in which case increment m by one and repeat #1.
In #2 and #3 there are of course different ways to order possible reallocations and pairwise exchanges.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio