Related
In the following question, we're talking about an algorithm which transposes a matrix of complex values struct complex {double real = 0.0; double imag = 0.0;};. Owing to a special data-layout, there is a stride-n*n access between the rows, which means that the loading of a subsequent row causes the eviction of the previously loaded row. All runs have been done using 1 thread only.
I'm trying to understand why my 'optimized' transpose function, which makes use of 2D blocking, is performing badly (coming from: 2D blocking with unique matrix transpose problem) and so I'm trying to use performance counters/cache simulators to get a reading on what's going wrong.
According to my analysis, if n=500 is the size of the matrix, b=4 is my block-size and c=4 is my cache-line size, we have for the naive algorithm,
for (auto i1 = std::size_t{}; i1 < n1; ++i1)
{
for (auto i3 = std::size_t{}; i3 < n3; ++i3)
{
mat_out(i3, i1) = mat_in(i1, i3);
}
}
Number of cache-references: (read) n*n + (write) n*n
Number of cache-misses: (read) n*n / c + (write) n*n
Rate of misses: 62.5 %.
Sure enough, I'm getting the same output as per cachegrind:
==21470== Cachegrind, a cache and branch-prediction profiler
==21470== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21470== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21470== Command: ./benchmark/benchmarking_transpose_vslices_dir2_naive 500
==21470==
--21470-- warning: L3 cache found, using its data for the LL simulation.
--21470-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--21470-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==21470==
==21470== I refs: 30,130,879,636
==21470== I1 misses: 7,666
==21470== LLi misses: 6,286
==21470== I1 miss rate: 0.00%
==21470== LLi miss rate: 0.00%
==21470==
==21470== D refs: 13,285,386,487 (6,705,198,115 rd + 6,580,188,372 wr)
==21470== D1 misses: 8,177,337,186 (1,626,402,679 rd + 6,550,934,507 wr)
==21470== LLd misses: 3,301,064,720 (1,625,156,375 rd + 1,675,908,345 wr)
==21470== D1 miss rate: 61.6% ( 24.3% + 99.6% )
==21470== LLd miss rate: 24.8% ( 24.2% + 25.5% )
==21470==
==21470== LL refs: 8,177,344,852 (1,626,410,345 rd + 6,550,934,507 wr)
==21470== LL misses: 3,301,071,006 (1,625,162,661 rd + 1,675,908,345 wr)
==21470== LL miss rate: 7.6% ( 4.4% + 25.5% )
Now for the implementation with blocking, I expect,
Hint: The following code is without remainder loops. The container intermediate_result, sized b x b, as per suggestion by #JérômeRichard, is used in order to prevent cache-thrashing.
for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
{
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
intermediate_result(i3, i1) = mat_in(bi1 + i1, bi3 + i3);
}
}
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
#pragma omp simd safelen(8)
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
mat_out(bi3 + i1, bi1 + i3) = intermediate_result(i1, i3);
}
}
}
}
Number of cache-references: (read) b*b + (write) b*b
Number of cache-misses: (read) b*b / c + (write) b*b / c
Rate of misses: 25 %.
Once again, cachegrind gives me the following report:
==21473== Cachegrind, a cache and branch-prediction profiler
==21473== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21473== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21473== Command: ./benchmark/benchmarking_transpose_vslices_dir2_best 500 4
==21473==
--21473-- warning: L3 cache found, using its data for the LL simulation.
--21473-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--21473-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==21473==
==21473== I refs: 157,135,137,350
==21473== I1 misses: 11,057
==21473== LLi misses: 9,604
==21473== I1 miss rate: 0.00%
==21473== LLi miss rate: 0.00%
==21473==
==21473== D refs: 43,995,141,079 (29,709,076,051 rd + 14,286,065,028 wr)
==21473== D1 misses: 3,307,834,114 ( 1,631,898,173 rd + 1,675,935,941 wr)
==21473== LLd misses: 3,301,066,570 ( 1,625,157,620 rd + 1,675,908,950 wr)
==21473== D1 miss rate: 7.5% ( 5.5% + 11.7% )
==21473== LLd miss rate: 7.5% ( 5.5% + 11.7% )
==21473==
==21473== LL refs: 3,307,845,171 ( 1,631,909,230 rd + 1,675,935,941 wr)
==21473== LL misses: 3,301,076,174 ( 1,625,167,224 rd + 1,675,908,950 wr)
==21473== LL miss rate: 1.6% ( 0.9% + 11.7% )
I cannot explain this discrepancy at this point, except to speculate that this might be because of prefetching.
Now, when I watch the same naive implementation using perf (with option "-d"), I get:
Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_naive 500':
91.122,33 msec task-clock # 0,933 CPUs utilized
870.939 context-switches # 0,010 M/sec
17 cpu-migrations # 0,000 K/sec
50.807.083 page-faults # 0,558 M/sec
354.169.268.894 cycles # 3,887 GHz
217.031.159.494 instructions # 0,61 insn per cycle
34.980.334.095 branches # 383,883 M/sec
148.578.378 branch-misses # 0,42% of all branches
58.473.530.591 L1-dcache-loads # 641,704 M/sec
12.636.479.302 L1-dcache-load-misses # 21,61% of all L1-dcache hits
440.543.654 LLC-loads # 4,835 M/sec
276.733.102 LLC-load-misses # 62,82% of all LL-cache hits
97,705649040 seconds time elapsed
45,526653000 seconds user
47,295247000 seconds sys
When I do the same for the implementation with 2D-blocking, I get:
Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_best 500 4':
79.865,16 msec task-clock # 0,932 CPUs utilized
766.200 context-switches # 0,010 M/sec
12 cpu-migrations # 0,000 K/sec
50.807.088 page-faults # 0,636 M/sec
310.452.015.452 cycles # 3,887 GHz
343.399.743.845 instructions # 1,11 insn per cycle
51.889.725.247 branches # 649,717 M/sec
133.541.902 branch-misses # 0,26% of all branches
81.279.037.114 L1-dcache-loads # 1017,703 M/sec
7.722.318.725 L1-dcache-load-misses # 9,50% of all L1-dcache hits
399.149.174 LLC-loads # 4,998 M/sec
123.134.807 LLC-load-misses # 30,85% of all LL-cache hits
85,660207381 seconds time elapsed
34,524170000 seconds user
46,884443000 seconds sys
Questions:
Why is there a strong difference in the output here for L1D and LLC?
Why are we seeing such bad L3 cache-miss rate (according to perf) in case of the blocking algorithm? This is obviously exacerbated when I start using 6 cores.
Any tips on how to detect cache-thrashing will also be appreciated.
Thanks in advance for your time and help, I'm glad to provide additional information upon request.
Additional Info:
The processor used for testing here is the (Coffee Lake) Intel(R) Core(TM) i5-8400 CPU # 2.80GHz.
CPU with 6 cores operating at 2.80 GHz - 4.00 GHz
L1 6x 32 KiB 8-way set associative (64 sets)
L2 6x 256 KiB 4-way set associative (1024 sets)
shared L3 9 MiB 12-way set associative (12288 sets)
I am trying to do live audio processing using the Intel HD Graphics GPU. I theory they should be perfect for this. But I am surprised at cost of the enqueuing commands. This looks to be a prohibiting factor, and by far the most time-consuming step.
In short calling the enqueueXXXXX commands take a long time. Actually doing the data copying and executing the kernel is sufficiently fast. Is this just an inherent problem with the OpenCL implementation, or am I doing something wrong?
Data copying + kernel execution takes about 10us
Calling the enqueue commands takes about 300us - 500us
The code is available at https://github.com/tblum/opencl_enqueue/blob/master/main.cpp
for (int i = 0; i < 10; ++i) {
cl::Event copyToEvent;
cl::Event copyFromEvent;
cl::Event kernelEvent;
auto t1 = Clock::now();
commandQueue.enqueueWriteBuffer(clIn, CL_FALSE, 0, 10 * 48 * sizeof(float), frameBufferIn, nullptr, ©ToEvent);
OCLdownMix.setArg(0,clIn);
OCLdownMix.setArg(1,clOut);
OCLdownMix.setArg(2,(unsigned int)480);
commandQueue.enqueueNDRangeKernel(OCLdownMix, cl::NullRange, cl::NDRange(480), cl::NDRange(48), nullptr, &kernelEvent);
commandQueue.enqueueReadBuffer(clOut, CL_FALSE, 0, 10 * 48 * sizeof(float), clResult, nullptr, ©FromEvent);
auto t2 = Clock::now();
commandQueue.finish();
auto t3 = Clock::now();
cl_ulong copyToTime = copyToEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
copyToEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong kernelTime = kernelEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
kernelEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong copyFromTime = copyFromEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
copyFromEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
std::cout << "Enqueue: " << t2 - t1 << ", Total: " << t3 - t1 << ", GPU: " << (copyToTime+kernelTime+copyFromTime) / 1000.0 << "us"<< std::endl;
}
Output:
Enqueue: 1804us, Total: 4322us, GPU: 10.832us
Enqueue: 485us, Total: 668us, GPU: 10.666us
Enqueue: 237us, Total: 419us, GPU: 10.499us
Enqueue: 282us, Total: 474us, GPU: 10.832us
Enqueue: 345us, Total: 531us, GPU: 10.082us
Enqueue: 359us, Total: 555us, GPU: 10.915us
Enqueue: 345us, Total: 524us, GPU: 10.082us
Enqueue: 327us, Total: 504us, GPU: 10.416us
Enqueue: 363us, Total: 540us, GPU: 10.333us
Enqueue: 442us, Total: 595us, GPU: 10.916us
I found this related question: How to reduce OpenCL enqueue time/any other ideas?
But no useful answers for my situation.
Any help or ideas would be appreciated.
Thanks
BR Troels
I'm currently doing a project with CUDA where a pipeline is refreshed with 200-10000 new events every 1ms. Each time, I want to call one(/two) kernels which compute a small list of outputs; then fed those outputs to the next element of the pipeline.
The theoretical flow is:
receive data in an std::vector
cudaMemcpy the vector to GPU
processing
generate small list of outputs
cudaMemcpy to the output std::vector
But when I'm calling cudaDeviceSynchronize on a 1block/1thread empty kernel with no processing, it already takes in average 0.7 to 1.4ms, which is already higher than my 1ms timeframe.
I could eventually change the timeframe of the pipeline in order to receive events every 5ms, but with 5x more each times. It wouldn't be ideal though.
What would be the best way to minimize the overhead of cudaDeviceSynchronize? Could streams be helpful in this situation? Or another solution to efficiently run the pipeline.
(Jetson TK1, compute capabilities 3.2)
Here's a nvprof log of the applications:
==8285== NVPROF is profiling process 8285, command: python player.py test.rec
==8285== Profiling application: python player.py test.rec
==8285== Profiling result:
Time(%) Time Calls Avg Min Max Name
94.92% 47.697ms 5005 9.5290us 1.7500us 13.083us reset_timesurface(__int64, __int64*, __int64*, __int64*, __int64*, float*, float*, bool*, bool*, Event*)
5.08% 2.5538ms 8 319.23us 99.750us 413.42us [CUDA memset]
==8285== API calls:
Time(%) Time Calls Avg Min Max Name
75.00% 5.03966s 5005 1.0069ms 25.083us 11.143ms cudaDeviceSynchronize
17.44% 1.17181s 5005 234.13us 83.750us 3.1391ms cudaLaunch
4.71% 316.62ms 9 35.180ms 23.083us 314.99ms cudaMalloc
2.30% 154.31ms 50050 3.0830us 1.0000us 2.6866ms cudaSetupArgument
0.52% 34.857ms 5005 6.9640us 2.5000us 464.67us cudaConfigureCall
0.02% 1.2048ms 8 150.60us 71.917us 183.33us cudaMemset
0.01% 643.25us 83 7.7490us 1.3330us 287.42us cuDeviceGetAttribute
0.00% 12.916us 2 6.4580us 2.0000us 10.916us cuDeviceGetCount
0.00% 5.3330us 1 5.3330us 5.3330us 5.3330us cuDeviceTotalMem
0.00% 4.0830us 1 4.0830us 4.0830us 4.0830us cuDeviceGetName
0.00% 3.4160us 2 1.7080us 1.5830us 1.8330us cuDeviceGet
A small reconstitution of the program (nvprof log at the end) - for some reason, the average of cudaDeviceSynchronize is 4 times lower, but it's still really high for an empty 1-thread kernel:
/* Compile with `nvcc test.cu -I.`
* with -I pointing to "helper_cuda.h" and "helper_string.h" from CUDA samples
**/
#include <iostream>
#include <cuda.h>
#include <helper_cuda.h>
#define MAX_INPUT_BUFFER_SIZE 131072
typedef struct {
unsigned short x;
unsigned short y;
short a;
long long b;
} Event;
long long *d_a_[2], *d_b_[2];
float *d_as_, *d_bs_;
bool *d_some_bool_[2];
Event *d_data_;
int width_ = 320;
int height_ = 240;
__global__ void reset_timesurface(long long ts,
long long *d_a_0, long long *d_a_1,
long long *d_b_0, long long *d_b_1,
float *d_as, float *d_bs,
bool *d_some_bool_0, bool *d_some_bool_1, Event *d_data) {
// nothing here
}
void reset_errors(long long ts) {
static const int n = 1024;
static const dim3 grid_size(width_ * height_ / n
+ (width_ * height_ % n != 0), 1, 1);
static const dim3 block_dim(n, 1, 1);
reset_timesurface<<<1, 1>>>(ts, d_a_[0], d_a_[1],
d_b_[0], d_b_[1],
d_as_, d_bs_,
d_some_bool_[0], d_some_bool_[1], d_data_);
cudaDeviceSynchronize();
// static long long *h_holder = (long long*)malloc(sizeof(long long) * 2000);
// cudaMemcpy(h_holder, d_a_[0], 0, cudaMemcpyDeviceToHost);
}
int main(void) {
checkCudaErrors(cudaMalloc(&(d_a_[0]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_a_[0], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_a_[1]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_a_[1], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_b_[0]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_b_[0], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_b_[1]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_b_[1], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_as_, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMemset(d_as_, 0, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_bs_, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMemset(d_bs_, 0, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_some_bool_[0]), sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMemset(d_some_bool_[0], 0, sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_some_bool_[1]), sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMemset(d_some_bool_[1], 0, sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_data_, sizeof(Event)*MAX_INPUT_BUFFER_SIZE));
for (int i = 0; i < 5005; ++i)
reset_errors(16487L);
cudaFree(d_a_[0]);
cudaFree(d_a_[1]);
cudaFree(d_b_[0]);
cudaFree(d_b_[1]);
cudaFree(d_as_);
cudaFree(d_bs_);
cudaFree(d_some_bool_[0]);
cudaFree(d_some_bool_[1]);
cudaFree(d_data_);
cudaDeviceReset();
}
/* nvprof ./a.out
==9258== NVPROF is profiling process 9258, command: ./a.out
==9258== Profiling application: ./a.out
==9258== Profiling result:
Time(%) Time Calls Avg Min Max Name
92.64% 48.161ms 5005 9.6220us 6.4160us 13.250us reset_timesurface(__int64, __int64*, __int64*, __int64*, __int64*, float*, float*, bool*, bool*, Event*)
7.36% 3.8239ms 8 477.99us 148.92us 620.17us [CUDA memset]
==9258== API calls:
Time(%) Time Calls Avg Min Max Name
53.12% 1.22036s 5005 243.83us 9.6670us 8.5762ms cudaDeviceSynchronize
25.10% 576.78ms 5005 115.24us 44.250us 11.888ms cudaLaunch
9.13% 209.77ms 9 23.308ms 16.667us 208.54ms cudaMalloc
6.56% 150.65ms 1 150.65ms 150.65ms 150.65ms cudaDeviceReset
5.33% 122.39ms 50050 2.4450us 833ns 6.1167ms cudaSetupArgument
0.60% 13.808ms 5005 2.7580us 1.0830us 104.25us cudaConfigureCall
0.10% 2.3845ms 9 264.94us 22.333us 537.75us cudaFree
0.04% 938.75us 8 117.34us 58.917us 169.08us cudaMemset
0.02% 461.33us 83 5.5580us 1.4160us 197.58us cuDeviceGetAttribute
0.00% 15.500us 2 7.7500us 3.6670us 11.833us cuDeviceGetCount
0.00% 7.6670us 1 7.6670us 7.6670us 7.6670us cuDeviceTotalMem
0.00% 4.8340us 1 4.8340us 4.8340us 4.8340us cuDeviceGetName
0.00% 3.6670us 2 1.8330us 1.6670us 2.0000us cuDeviceGet
*/
As detailled in the comments of the original message, my problem was entirely related to the GPU I'm using (Tegra K1). Here's an answer I found for this particular problem; it might be useful for other GPUs as well. The average for cudaDeviceSynchronize on my Jetson TK1 went from 250us to 10us.
The rate of the Tegra was 72000kHz by default, we'll have to set it to 852000kHz using this command:
$ echo 852000000 > /sys/kernel/debug/clock/override.gbus/rate
$ echo 1 > /sys/kernel/debug/clock/override.gbus/state
We can find the list of available frequency using this command:
$ cat /sys/kernel/debug/clock/gbus/possible_rates
72000 108000 180000 252000 324000 396000 468000 540000 612000 648000 684000 708000 756000 804000 852000 (kHz)
More performance can be obtained (again, in exchange for a higher power draw) on both the CPU and GPU; check this link for more informations.
A^2+B^2+C^2+D^2 = N Given an integer N, print out all possible combinations of integer values of ABCD which solve the equation.
I am guessing we can do better than brute force.
Naive brute force would be something like:
n = 3200724;
lim = sqrt (n) + 1;
for (a = 0; a <= lim; a++)
for (b = 0; b <= lim; b++)
for (c = 0; c <= lim; c++)
for (d = 0; d <= lim; d++)
if (a * a + b * b + c * c + d * d == n)
printf ("%d %d %d %d\n", a, b, c, d);
Unfortunately, this will result in over a trillion loops, not overly efficient.
You can actually do substantially better than that by discounting huge numbers of impossibilities at each level, with something like:
#include <stdio.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = 0, nb = na - a * a; b * b <= nb; b++) {
for (c = 0, nc = nb - b * b; c * c <= nc; c++) {
for (d = 0, nd = nc - c * c; d * d <= nd; d++) {
if (d * d == nd) {
printf ("%d %d %d %d\n", a, b, c, d);
count++;
}
tot++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
It's still brute force, but not quite as brutish inasmuch as it understands when to stop each level of looping as early as possible.
On my (relatively) modest box, that takes under a second (a) to get all solutions for numbers up to 50,000. Beyond that, it starts taking more time:
n time taken
---------- ----------
100,000 3.7s
1,000,000 6m, 18.7s
For n = ten million, it had been going about an hour and a half before I killed it.
So, I would say brute force is perfectly acceptable up to a point. Beyond that, more mathematical solutions would be needed.
For even more efficiency, you could only check those solutions where d >= c >= b >= a. That's because you could then build up all the solutions from those combinations into permutations (with potential duplicate removal where the values of two or more of a, b, c, or d are identical).
In addition, the body of the d loop doesn't need to check every value of d, just the last possible one.
Getting the results for 1,000,000 in that case takes under ten seconds rather than over six minutes:
0 0 0 1000
0 0 280 960
0 0 352 936
0 0 600 800
0 24 640 768
: : : :
424 512 512 544
428 460 500 596
432 440 480 624
436 476 532 548
444 468 468 604
448 464 520 560
452 452 476 604
452 484 484 572
500 500 500 500
Found 1302 solutions
real 0m9.517s
user 0m9.505s
sys 0m0.012s
That code follows:
#include <stdio.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = a, nb = na - a * a; b * b <= nb; b++) {
for (c = b, nc = nb - b * b; c * c <= nc; c++) {
for (d = c, nd = nc - c * c; d * d < nd; d++);
if (d * d == nd) {
printf ("%4d %4d %4d %4d\n", a, b, c, d);
count++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
And, as per a suggestion by DSM, the d loop can disappear altogether (since there's only one possible value of d (discounting negative numbers) and it can be calculated), which brings the one million case down to two seconds for me, and the ten million case to a far more manageable 68 seconds.
That version is as follows:
#include <stdio.h>
#include <math.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = a, nb = na - a * a; b * b <= nb; b++) {
for (c = b, nc = nb - b * b; c * c <= nc; c++) {
nd = nc - c * c;
d = sqrt (nd);
if (d * d == nd) {
printf ("%d %d %d %d\n", a, b, c, d);
count++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
(a): All timings are done with the inner printf commented out so that I/O doesn't skew the figures.
The Wikipedia page has some interesting background information, but Lagrange's four-square theorem (or, more correctly, Bachet's Theorem - Lagrange only proved it) doesn't really go into detail on how to find said squares.
As I said in my comment, the solution is going to be nontrivial. This paper discusses the solvability of four-square sums. The paper alleges that:
There is no convenient algorithm (beyond the simple one mentioned in
the second paragraph of this paper) for finding additional solutions
that are indicated by the calculation of representations, but perhaps
this will streamline the search by giving an idea of what kinds of
solutions do and do not exist.
There are a few other interesting facts related to this topic. There
exist other theorems that state that every integer can be written as a
sum of four particular multiples of squares. For example, every
integer can be written as N = a^2 + 2b^2 + 4c^2 + 14d^2. There are 54
cases like this that are true for all integers, and Ramanujan provided
the complete list in the year 1917.
For more information, see Modular Forms. This is not easy to understand unless you have some background in number theory. If you could generalize Ramanujan's 54 forms, you may have an easier time with this. With that said, in the first paper I cite, there is a small snippet which discusses an algorithm that may find every solution (even though I find it a bit hard to follow):
For example, it was reported in 1911 that the calculator Gottfried
Ruckle was asked to reduce N = 15663 as a sum of four squares. He
produced a solution of 125^2 + 6^2 + 1^2 + 1^2 in 8 seconds, followed
immediately by 125^2 + 5^2 + 3^2 + 2^2. A more difficult problem
(reflected by a first term that is farther from the original number,
with correspondingly larger later terms) took 56 seconds: 11399 = 105^2
+ 15^2 + 8^2 + 5^2. In general, the strategy is to begin by setting the first term to be the largest square below N and try to represent the
smaller remainder as a sum of three squares. Then the first term is
set to the next largest square below N, and so forth. Over time a
lightning calculator would become familiar with expressing small
numbers as sums of squares, which would speed up the process.
(Emphasis mine.)
The algorithm is described as being recursive, but it could easily be implemented iteratively.
It seems as though all integers can be made by such a combination:
0 = 0^2 + 0^2 + 0^2 + 0^2
1 = 1^2 + 0^2 + 0^2 + 0^2
2 = 1^2 + 1^2 + 0^2 + 0^2
3 = 1^2 + 1^2 + 1^2 + 0^2
4 = 2^2 + 0^2 + 0^2 + 0^2, 1^2 + 1^2 + 1^2 + 1^2 + 1^2
5 = 2^2 + 1^2 + 0^2 + 0^2
6 = 2^2 + 1^2 + 1^2 + 0^2
7 = 2^2 + 1^2 + 1^2 + 1^2
8 = 2^2 + 2^2 + 0^2 + 0^2
9 = 3^2 + 0^2 + 0^2 + 0^2, 2^2 + 2^2 + 1^2 + 0^2
10 = 3^2 + 1^2 + 0^2 + 0^2, 2^2 + 2^2 + 1^2 + 1^2
11 = 3^2 + 1^2 + 1^2 + 0^2
12 = 3^2 + 1^2 + 1^2 + 1^2, 2^2 + 2^2 + 2^2 + 0^2
.
.
.
and so forth
As I did some initial working in my head, I thought that it would be only the perfect squares that had more than 1 possible solution. However after listing them out it seems to me there is no obvious order to them. However, I thought of an algorithm I think is most appropriate for this situation:
The important thing is to use a 4-tuple (a, b, c, d). In any given 4-tuple which is a solution to a^2 + b^2 + c^2 + d^2 = n, we will set ourselves a constraint that a is always the largest of the 4, b is next, and so on and so forth like:
a >= b >= c >= d
Also note that a^2 cannot be less than n/4, otherwise the sum of the squares will have to be less than n.
Then the algorithm is:
1a. Obtain floor(square_root(n)) # this is the maximum value of a - call it max_a
1b. Obtain the first value of a such that a^2 >= n/4 - call it min_a
2. For a in a range (min_a, max_a)
At this point we have selected a particular a, and are now looking at bridging the gap from a^2 to n - i.e. (n - a^2)
3. Repeat steps 1a through 2 to select a value of b. This time instead of finding
floor(square_root(n)) we find floor(square_root(n - a^2))
and so on and so forth. So the entire algorithm would look something like:
1a. Obtain floor(square_root(n)) # this is the maximum value of a - call it max_a
1b. Obtain the first value of a such that a^2 >= n/4 - call it min_a
2. For a in a range (min_a, max_a)
3a. Obtain floor(square_root(n - a^2))
3b. Obtain the first value of b such that b^2 >= (n - a^2)/3
4. For b in a range (min_b, max_b)
5a. Obtain floor(square_root(n - a^2 - b^2))
5b. Obtain the first value of b such that b^2 >= (n - a^2 - b^2)/2
6. For c in a range (min_c, max_c)
7. We now look at (n - a^2 - b^2 - c^2). If its square root is an integer, this is d.
Otherwise, this tuple will not form a solution
At steps 3b and 5b I use (n - a^2)/3, (n - a^2 - b^2)/2. We divide by 3 or 2, respectively, because of the number of values in the tuple not yet 'fixed'.
An example:
doing this on n = 12:
1a. max_a = 3
1b. min_a = 2
2. for a in range(2, 3):
use a = 2
3a. we now look at (12 - 2^2) = 8
max_b = 2
3b. min_b = 2
4. b must be 2
5a. we now look at (12 - 2^2 - 2^2) = 4
max_c = 2
5b. min_c = 2
6. c must be 2
7. (n - a^2 - b^2 - c^2) = 0, hence d = 0
so a possible tuple is (2, 2, 2, 0)
2. use a = 3
3a. we now look at (12 - 3^2) = 3
max_b = 1
3b. min_b = 1
4. b must be 1
5a. we now look at (12 - 3^2 - 1^2) = 2
max_c = 1
5b. min_c = 1
6. c must be 1
7. (n - a^2 - b^2 - c^2) = 1, hence d = 1
so a possible tuple is (3, 1, 1, 1)
These are the only two possible tuples - hey presto!
nebffa has a great answer. one suggestion:
step 3a: max_b = min(a, floor(square_root(n - a^2))) // since b <= a
max_c and max_d can be improved in the same way too.
Here is another try:
1. generate array S: {0, 1, 2^2, 3^2,.... nr^2} where nr = floor(square_root(N)).
now the problem is to find 4 numbers from the array that sum(a, b,c,d) = N;
2. according to neffa's post (step 1a & 1b), a (which is the largest among all 4 numbers) is between [nr/2 .. nr].
We can loop a from nr down to nr/2 and calculate r = N - S[a];
now the question is to find 3 numbers from S the sum(b,c,d) = r = N -S[a];
here is code:
nr = square_root(N);
S = {0, 1, 2^2, 3^2, 4^2,.... nr^2};
for (a = nr; a >= nr/2; a--)
{
r = N - S[a];
// it is now a 3SUM problem
for(b = a; b >= 0; b--)
{
r1 = r - S[b];
if (r1 < 0)
continue;
if (r1 > N/2) // because (a^2 + b^2) >= (c^2 + d^2)
break;
for (c = 0, d = b; c <= d;)
{
sum = S[c] + S[d];
if (sum == r1)
{
print a, b, c, d;
c++; d--;
}
else if (sum < r1)
c++;
else
d--;
}
}
}
runtime is O(sqare_root(N)^3).
Here is the test result running java on my VM (time in milliseconds, result# is total num of valid combination, time 1 with printout, time2 without printout):
N result# time1 time2
----------- -------- -------- -----------
1,000,000 1302 859 281
10,000,000 6262 16109 7938
100,000,000 30912 442469 344359
I thought that the Cont monad is just equivalent to CPS Transformation, so if I have
a monadic sum, if I run in the Identity monad, it will fail due to stack overflow, and if
I run it in the Cont Monad, it will be okay due to tail recursion.
So I've written a simple program to verify my idea. But to my surprise, the result is unreasonable due to my limited knowledge.
All programs are compiled using ghc --make Test.hs -o test && ./test
sum0 n = if n==0 then 0 else n + sum0 (n-1)
sum1 n = if n==0 then return 0 else sum1 (n-1) >>= \ v -> seq v (return (n+v))
sum2 n k = if n == 0 then k 0 else sum2 n (\v -> k (n + v))
sum3 n k = if n == 0 then k 0 else sum3 n (\ !v -> k (n + v))
sum4 n k = if n == 0 then k 0 else sum4 n (\ v -> seq v ( k (n + v)))
sum5 n = if n==0 then return 0 else sum5 (n-1) >>= \ v -> (return (n+v))
main = print (sum0 3000000)
Stack overflow. This is reasonable.
main = print (flip runCont id (sum1 3000000))
Uses 180M memory, which is reasonable, but I am not clear why seq needed here, since its continuation is not applied until n goes to 0.
main = print (flip runCont id (sum5 3000000))
Stack overflow. Why?
main = print (flip runCont (const 0) (sum1 3000000))
Uses 130M memory. This is reasonable.
main = print (flip runCont (const 0) (sum5 3000000))
Uses 118M memory. This is reasonable.
main = print (sum2 3000000 (const 0))
Uses a lot of memory (more than 1G). I thought sum2 is equivalent to sum5 (when sum5 is in Cont monad). Why?
main = print (sum3 3000000 (const 0))
Uses a lot of memory. I thought sum3 is equivalent to sum1 (Cont monad). Why?
main = print (runIdentity (sum1 3000000))
Stack overflow, exactly what I want.
main = print (sum3 3000000 id)
Uses a lot of memory. Equivalent to sum1, why?
main = print (sum4 3000000 id)
Uses a lot of memory. Equivalent to sum1, why?
main = print (sum [1 .. 3000000])
Stack overflow. The source of sum = foldl (+) 0, so this is reasonable.
main = print (foldl' (+) 0 [1 .. 3000000])
Uses 1.5M.
First of all, it looks to me like sum2, sum3, and sum4 never actually decrement n. So they're using lots of memory because they're going into an infinite loop that does allocation.
After correcting that, I've run each of your tests again with the following results, where "allocation" refers to approximate peak memory use:
main = print (sum0 3000000) : Stack overflow, after allocating very little memory
main = print (flip runCont id (sum1 3000000)) : Success, allocating similar amounts to what you saw
main = print (flip runCont id (sum5 3000000)) : Stack overflow, after allocating similar amounts of memory as sum1.
main = print (flip runCont (const 0) (sum1 3000000)) : Success, similar allocation as the above
main = print (flip runCont (const 0) (sum5 3000000)) : Same
main = print (sum2 3000000 (const 0)) : Success, about 70% as much allocation as sum1
main = print (sum3 3000000 (const 0)) : Success, about 50% as much allocation as sum1
main = print (runIdentity (sum1 3000000)) : Stack overflow, with little allocation
main = print (sum3 3000000 id) : Success, about 50% as much allocation as sum1
main = print (sum4 3000000 id) : Success, about 50% as much allocation as sum1
main = print (sum [1 .. 3000000]) : Stack overflow, with about 80% as much allocation as sum1
main = print (foldl' (+) 0 [1 .. 3000000]) : Success, with almost no allocation
So that's mostly what you expected, with the exception of why seq makes such a difference between sum1 vs. sum5.