Random memory write is slower than random memory read?

Random memory write is slower than random memory read? - performance

I'm trying to figure out memory access time of sequential/random memory read/write. Here's the code:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define PRINT_EXCECUTION_TIME(msg, code) \
do { \
struct timeval t1, t2; \
double elapsed; \
gettimeofday(&t1, NULL); \
do { \
code; \
} while (0); \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0; \
elapsed += (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf(msg " time: %f ms\n", elapsed); \
} while (0);
const int RUNS = 20;
const int N = (1 << 27) - 1;
int *data;
int seqR() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + 1) & N;
res = data_p[pos];
}
}
return res;
}
int seqW() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + 1) & N;
data_p[pos] = res;
}
}
return res;
}
int rndR() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + i) & N;
res = data_p[pos];
}
}
return res;
}
int rndW() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + i) & N;
data_p[pos] = res;
}
}
return res;
}
int main() {
data = (int *)malloc(sizeof(int) * (N + 1));
assert(data);
for (int i = 0; i < N; i++) {
data[i] = i;
}
for (int i = 0; i < 10; i++) {
PRINT_EXCECUTION_TIME("seqR", seqR());
PRINT_EXCECUTION_TIME("seqW", seqW());
PRINT_EXCECUTION_TIME("rndR", rndR());
PRINT_EXCECUTION_TIME("rndW", rndW());
}
return 0;
}
I used gcc 6.5.0 with -O0 to prevent optimization but got result like this:
seqR time: 2538.010000 ms
seqW time: 2394.991000 ms
rndR time: 40625.169000 ms
rndW time: 46184.652000 ms
seqR time: 2411.038000 ms
seqW time: 2309.115000 ms
rndR time: 41575.063000 ms
rndW time: 46206.275000 ms
It's easy to understand that sequential access is way faster than random access. However, it doesn't make sense to me that random write is slower than random read while sequential write is faster than sequential read. What reason could cause this?
In addition, am I safe to say memory bandwidth for seqR is (20 * ((1 << 27) - 1) * 4 * 1024 * 1024 * 1024)GB / (2.538)s = 4.12GB/s?

Sounds normal. All x86-64 CPUs (and most other modern CPUs) use write-back / write-allocate caches so a write costs a read before it can commit to cache, and an eventual write-back.
with -O0 to prevent optimization
Since you used register on all your locals, this is one of the rare times when this didn't make your benchmark meaningless.
You could have just used volatile on your arrays, though, to make sure every one of those accesses happened in order, but leave it up to the optimizer how to make that happen.
Am I safe to say memory bandwidth for seqR is (20 * ((1 << 27) - 1) * 4 * 1024 * 1024 * 1024)GB / (2.538)s = 4.12GB/s?
No, you have an extra factor of 2^30 and 10^9 in your numerator. But you did it wrong and got close to the right number anyway.
The correct calculation is RUNS * N * sizeof(int) / time bytes per second, or that divided by 10^9 GB/s. Or divided by 2^30 for base 2 GiB/s. Memory sizes are usually in GiB, but you can take your pick with bandwidth; DRAM clock speeds are normally things like 1600 MHz, so base-10 GB = 10^9 is certainly normal for theoretical max bandwidths in GB/s.)
So 4.23 GB/s in base-10 GB.
Yes, you initialized the array first so neither timed run is triggering page-faults, but I might still have used the 2nd run after the CPU has warmed up to max turbo, if it hadn't already.
But keep in mind this is un-optimized code. That's how fast your un-optimized code ran, and doesn't tell you much about how fast your memory is. It's probably CPU bound, not memory.
Especially with a redundant & N in there to match the CPU work of the rndR/W functions. HW prefetching is probably able to keep up with 4GB/s, but it's still not even reading 1 int per clock cycle.

Related

Error correction on small message (8-Bit) with high resilience, what is the best method?

I need to implement an ECC algorithm on an 8-bit message with 32 bits to work with (32, 8), being new to ECC I started to google and learn a bit about it and ended up coming across two ECC methods, Hamming codes and Reed Solomon. Given that I needed my message to be resilient to 4-8 random bit flips on average I disregarded Hammings and looked into Reed, however, after applying it to my problem I realized it is also not suitable for my use case because while a whole symbol (8 bits) could be flipped, because my errors tend to spread out (on average), it can usually only fix a single error...
Therefore in the end I just settled for my first instinct which is to just copy the data over like so:
00111010 --> 0000 0000 1111 1111 1111 0000 1111 0000
This way every bit is resilient up to 1 error (8 across all bits) by taking the most prominent bits on each actual bit from the encoded message, and every bit can be subject to two bitflips while still detecting there was an error (which is also usable for my use case, eg: input 45: return [45, 173] is still useful).
My question then is if there is any better method, while I am pretty sure there is, I am not sure where to go from here.
By "better method" I mean resilient to even more errors given the (32, 8) ratio.

You can get a distance-11 code pretty easily using randomization.
#include <stdio.h>
#include <stdlib.h>
int main() {
uint32_t codes[256];
for (int i = 0; i < 256; i++) {
printf("%d\n", i);
retry:
codes[i] = arc4random();
for (int j = 0; j < i; j++) {
if (__builtin_popcount(codes[i] ^ codes[j]) < 11) goto retry;
}
}
}

I made a test program for David Eisenstat's example, to show it works for 1 to 5 bits in error. Code is for Visual Studio.
#include <intrin.h>
#include <stdio.h>
#include <stdlib.h>
typedef unsigned int uint32_t;
/*----------------------------------------------------------------------*/
/* InitCombination - init combination */
/*----------------------------------------------------------------------*/
void InitCombination(int a[], int k, int n) {
for(int i = 0; i < k; i++)
a[i] = i;
--a[k-1];
}
/*----------------------------------------------------------------------*/
/* NextCombination - generate next combination */
/*----------------------------------------------------------------------*/
int NextCombination(int a[], int k, int n) {
int pivot = k - 1;
while (pivot >= 0 && a[pivot] == n - k + pivot)
--pivot;
if (pivot == -1)
return 0;
++a[pivot];
for (int i = pivot + 1; i < k; ++i)
a[i] = a[pivot] + i - pivot;
return 1;
}
/*----------------------------------------------------------------------*/
/* Rnd32 - return pseudo random 32 bit number */
/*----------------------------------------------------------------------*/
uint32_t Rnd32()
{
static uint32_t r = 0;
r = r*1664525+1013904223;
return r;
}
static uint32_t codes[256];
/*----------------------------------------------------------------------*/
/* main - test random hamming distance 11 code */
/*----------------------------------------------------------------------*/
int main() {
int ptn[5]; /* error bit indexes */
int i, j, n;
uint32_t m;
int o, p;
for (i = 0; i < 256; i++) { /* generate table */
retry:
codes[i] = Rnd32();
for (j = 0; j < i; j++) {
if (__popcnt(codes[i] ^ codes[j]) < 11) goto retry;
}
}
for(n = 1; n <= 5; n++){ /* test 1 to 5 bit error patterns */
InitCombination(ptn, n, 32);
while(NextCombination(ptn, n, 32)){
for(i = 0; i < 256; i++){
o = m = codes[i]; /* o = m = coded msg */
for(j = 0; j < n; j++){ /* add errors to m */
m ^= 1<<ptn[j];
}
for(j = 0; j < 256; j++){ /* search for code */
if((p =__popcnt(m ^ codes[j])) <= 5)
break;
}
if(i != j){ /* check for match */
printf("fail %u %u\n", i, j);
goto exit0;
}
}
}
}
exit0:
return 0;
}

My code works for 200,000 prime numbers but shows segmentation error(core dumped) when i try to run it for 2,000,000 numbers

'My code works for 200,000 prime numbers but shows segmentation error(core dumped) when i try to run it for 2,000,000 numbers'
using namespace std;
int main(){
long long n;
cin>>n;
long long prime[n];
for(long long i=0;i<=n;i++){
prime[i]=1;
}
prime[0]=0;
prime[1]=0;
for(long long i=2;i<=sqrt(n);i++){
for(long long j=2;i*j<=n;j++){
prime[i*j]=0;
}
}
unsigned long long res=0;
for (long long i = 2; i <= n; ++i){
if(prime[i]==1){
res+=i;
}
}
cout<<res;
}

long long prime[n]; is a variable-length array and not supported by C++. Some compilers support it with a compiler extension. The memory is allocated on the stack and it's larger than the stack size. Use a std::vector to allocate the memory on the heap.
Don't access prime[n]. The last element is prime[n-1].
Use i * i <= n instead of i <= sqrt(n).
I recommend to use std::int64_t instead of long long. long long has at least 64 bits. std::int64_t has exactly 64 bits.
#include <iostream>
#include <vector>
int main(){
std::int64_t n;
std::cin >> n;
std::vector<std::int64_t> prime(n);
for(std::int64_t i = 0; i < n; ++i){
prime[i] = 1;
}
prime[0] = 0;
prime[1] = 0;
for(std::int64_t i = 2; i * i <= n; ++i){
for(std::int64_t j = 2; i * j < n; ++j){
prime[i * j] = 0;
}
}
std::uint64_t res = 0;
for (std::int64_t i = 2; i < n; ++i){
if(prime[i] == 1){
res += i;
}
}
std::cout << res;
}

Why inner product of same size matrix in Eigen cost quite different time?

I used Eigen to calculate inner product of two matrix, the first one is A=(BC).eval() and second one is D=(EF).eval(). Here B,C,E,F are the same size (1500 * 1500) but with different values. I find the first one cost about 200 ms while the second one cost about 6000 ms, I have no idea why this happened.
#include <iostream>
#include <time.h>
#include "Eigen/Dense"
int main() {
clock_t start, stop;
Eigen::MatrixXf mat_a(1200, 1500);
Eigen::MatrixXf mat_b(1500, 1500);
Eigen::MatrixXf mat_r(1000, 1300);
int i, j;
float c = 0;
for (i = 0; i < 1200; i++) {
for (j = 0; j < 1500; j++) {
mat_a(i, j) = (float)(c/3 * 1.0e-40);
//if (i % 2 == 0 && j % 2 == 0) mat_a(i, j);
c++;
}
}
//std::cout << mat_a.row(0) << std::endl;
c = 100;
for (i = 0; i < 1500; i++) {
for (j = 0; j < 1500; j++) {
mat_b(i, j) = (float)(c/3 * 0.5e-10);
c++;
}
}
//std::cout << mat_b.row(0) << std::endl;
start = clock();
mat_r = mat_a * mat_b;
stop = clock();
std::cout << stop - start << std::endl;
getchar();
return 0;
}
as show in above example code. I find this is caused by the value of the matrix, when mat_a has value about e-40 and mat_b has value about e-10, this problem occurs stably.
Is there anyone who can explain it?

This is because your matrix contains denormal numbers that are slow to deal with for the CPU. You should make sure that you are using reasonable units so that those can be considered as zeros, and then enable the flush-to-zero (FTZ) and denormals-as-zero flags (DAZ), for instance using the fast-math mode of your compiler or at runtime, see this SO question.

How Should I Implement Cilk Parallelism with a Recursive Scan Algorithm?

I implemented a recursive scan (prefix sum) algorithm, which I've included below. Here, I simply generate random lists of size powers of two up to the twenty-seventh power, checking against a simple sequential scan for accuracy. It works.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
#include <mkl.h>
int *pscan(int *x, int n, int z, int chunk_size);
int reduce(int *x, int n);
int main(int argc, char **argv)
{
int n;
int i, j, k;
int *x, *seq, *r;
double begin, end;
srand48(time(0));
/* Randomly generate array of size n. */
for (k = 2; k < 28; k++) {
n = (int) pow(2, k);
seq = (int *) malloc(sizeof(int) * n);
x = (int *) malloc(sizeof(int) * n);
for (i = 0; i < n; i++) {
x[i] = lrand48() % 100 - 50;
seq[i] = x[i];
}
/* Parallel scan. */
begin = dsecnd();
r = pscan(x, n, 0, 2);
end = dsecnd();
printf("%d %lf\n", n, end - begin);
/* Sequential check. */
for (i = 1; i < n; i++) {
seq[i] = seq[i - 1] + seq[i];
}
for (i = 0; i < n; i++) {
if (r[i] != seq[i]) {
fprintf(stderr, "AGGGHHH!!! ERROR. Found with vector: \n");
for (j = 0; j < n; j++) {
printf("%d ", x[i]);
}
printf("\n");
exit(1);
}
}
free(r);
free(x);
free(seq);
}
return 0;
}
/* Perform parallel scan. */
int *pscan(int *x, int n, int z, int chunk_size)
{
int i, j;
int *sums, *sumscan, *scan, **fsum, *rv;
/* Base case, serially scan a chunk. */
if (n <= chunk_size) {
scan = (int *) malloc(sizeof(int) * n);
scan[0] = x[0] + z;
for (i = 1; i < n; i++) {
scan[i] = x[i] + scan[i - 1];
}
return scan;
}
sums = (int *) malloc(sizeof(int) * (n / chunk_size));
/* Reduce each chunk of the array. */
for (i = 0; i < n / chunk_size; i++) {
sums[i] = reduce(&x[i * chunk_size], chunk_size);
}
/* Perform a scan on the sums. */
sumscan = pscan(sums, n / chunk_size, 0, chunk_size);
free(sums);
fsum = (int **) malloc(sizeof(int *) * (n / chunk_size));
/* Perform a recursive scan on each chunk, using
the appropriate offset from the sums scan. */
for (i = 0; i < n / chunk_size; i++) {
if (i > 0) {
fsum[i] = pscan(&x[i * chunk_size], chunk_size, sumscan[i - 1], chunk_size);
} else {
fsum[i] = pscan(&x[i * chunk_size], chunk_size, 0, chunk_size);
}
}
free(sumscan);
rv = (int *) malloc(sizeof(int) * n);
/* Join the arrays. */
for (i = 0; i < n / chunk_size; i++) {
for (j = 0; j < chunk_size; j++) {
rv[i * chunk_size + j] = fsum[i][j];
}
}
for (i = 0; i < n / chunk_size; i++) {
free(fsum[i]);
}
free(fsum);
return rv;
}
/* Serial reduction. */
int reduce(int *x, int n)
{
int i;
int sum;
sum = 0;
for (i = 0; i < n; i++) {
sum += x[i];
}
return sum;
}
Now, I'd like to parallelize it. Because I'm feeling a little hipster-ish, I've hacked up a Cilk implementation. I just replace the two main for loops to parallelize 1) the reduction and 2) the recursive scan of each chunk, using the appropriate scan of the chunk reductions as an offset. It looks like so.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
#include <cilk/cilk.h>
#include <mkl.h>
int *pscan(int *x, int n, int z, int chunk_size);
int reduce(int *x, int n);
int main(int argc, char **argv)
{
int n;
int i, j, k;
int *x, *seq, *r;
double begin, end;
srand48(time(0));
/* Randomly generate array of size n. */
for (k = 2; k < 28; k++) {
n = (int) pow(2, k);
seq = (int *) malloc(sizeof(int) * n);
x = (int *) malloc(sizeof(int) * n);
for (i = 0; i < n; i++) {
x[i] = lrand48() % 100 - 50;
seq[i] = x[i];
}
/* Parallel scan. */
begin = dsecnd();
r = pscan(x, n, 0, 2);
end = dsecnd();
printf("%d %lf\n", n, end - begin);
/* Sequential check. */
for (i = 1; i < n; i++) {
seq[i] = seq[i - 1] + seq[i];
}
for (i = 0; i < n; i++) {
if (r[i] != seq[i]) {
fprintf(stderr, "AGGGHHH!!! ERROR. Found with vector: \n");
for (j = 0; j < n; j++) {
printf("%d ", x[i]);
}
printf("\n");
exit(1);
}
}
free(r);
free(x);
free(seq);
}
return 0;
}
/* Perform parallel scan. */
int *pscan(int *x, int n, int z, int chunk_size)
{
int i, j;
int *sums, *sumscan, *scan, **fsum, *rv;
/* Base case, serially scan a chunk. */
if (n <= chunk_size) {
scan = (int *) malloc(sizeof(int) * n);
scan[0] = x[0] + z;
for (i = 1; i < n; i++) {
scan[i] = x[i] + scan[i - 1];
}
return scan;
}
sums = (int *) malloc(sizeof(int) * (n / chunk_size));
/* Reduce each chunk of the array. */
cilk_for (i = 0; i < n / chunk_size; i++) {
sums[i] = reduce(&x[i * chunk_size], chunk_size);
}
/* Perform a scan on the sums. */
sumscan = pscan(sums, n / chunk_size, 0, chunk_size);
free(sums);
fsum = (int **) malloc(sizeof(int *) * (n / chunk_size));
/* Perform a recursive scan on each chunk, using
the appropriate offset from the sums scan. */
cilk_for (i = 0; i < n / chunk_size; i++) {
if (i > 0) {
fsum[i] = pscan(&x[i * chunk_size], chunk_size, sumscan[i - 1], chunk_size);
} else {
fsum[i] = pscan(&x[i * chunk_size], chunk_size, 0, chunk_size);
}
}
free(sumscan);
rv = (int *) malloc(sizeof(int) * n);
/* Join the arrays. */
for (i = 0; i < n / chunk_size; i++) {
for (j = 0; j < chunk_size; j++) {
rv[i * chunk_size + j] = fsum[i][j];
}
}
for (i = 0; i < n / chunk_size; i++) {
free(fsum[i]);
}
free(fsum);
return rv;
}
/* Serial reduction. */
int reduce(int *x, int n)
{
int i;
int sum;
sum = 0;
for (i = 0; i < n; i++) {
sum += x[i];
}
return sum;
}
And it works! Well, it returns correct results. It doesn't achieve the performance I had hoped. The original performance was
4 0.000004
8 0.000001
16 0.000002
32 0.000003
64 0.000005
128 0.000011
256 0.000019
512 0.000035
1024 0.000068
2048 0.000130
4096 0.000257
8192 0.000512
16384 0.001129
32768 0.002262
65536 0.004519
131072 0.009065
262144 0.018297
524288 0.037416
1048576 0.078307
2097152 0.157448
4194304 0.313855
8388608 0.625689
16777216 1.251949
33554432 2.589439
67108864 5.084731
134217728 10.402186
for the single-threaded application, but the Cilk version performend worse, with the following runtimes
4 0.005383
8 0.000011
16 0.000009
32 0.000111
64 0.000055
128 0.000579
256 0.000339
512 0.000544
1024 0.000701
2048 0.001086
4096 0.001265
8192 0.001742
16384 0.002283
32768 0.003891
65536 0.005398
131072 0.009255
262144 0.020736
524288 0.058156
1048576 0.103893
2097152 0.215460
4194304 0.419988
8388608 0.749368
16777216 1.650938
33554432 2.960451
67108864 5.799836
134217728 11.294398
I have a 24-core machine, so we're obviously not seeing the speed-up we would hope for here. My first thought was that Cilk is mishandling the recursion, causing oversubscription, but Cilk is specifically supposed to handle recursion well. Any tips on how to implement this properly? I tried adding cilk_for to the bottom for loop (freeing everything) and the inner for-loop of the penultimate set of loops (joining the array), but that slowed performance down even more.
Any advice is well-appreciated.
However, please don't tell me to switch to Belloch's parallel scan algorithm discussed here. I already implemented that in Cilk, and it worked quite well. I'd like to see if I can match its performance with this recursive solution.

I fixed my performance problems by finding the optimal chunk size for each problem. At that chunk size, the (same) parallel version performs better than the sequential version.
In summary, there were a few things wrong with both my general approach and particularly the chunk size of two:
My benchmarking approach. In a code with a tuning parameter, it doesn't make much sense to plot runtime vs. problem size using the same value for the tuning parameter because the optimal value is dependent on problem size.
A chunk size of two was likely problematic because, while it maximizes parallelism, it also maximizes the number of levels of recursion and, likewise, the overhead that comes along with it.
A chunk size of two prevents vectorization.
As Leeor suggested, a chunk size of two probably also leads to false sharing in the cache.
Props to Leeor for leading me in the right direction.

gcc openmp thread reuse

I am using gcc's implementation of openmp to try to parallelize a program. Basically the assignment is to add omp pragmas to obtain speedup on a program that finds amicable numbers.
The original serial program was given(shown below except for the 3 lines I added with comments at the end). We have to parallize first just the outer loop, then just the inner loop. The outer loop was easy and I get close to ideal speedup for a given number of processors. For the inner loop, I get much worse performance than the original serial program. Basically what I am trying to do is a reduction on the sum variable.
Looking at the cpu usage, I am only using ~30% per core. What could be causing this? Is the program continually making new threads everytime it hits the omp parallel for clause? Is there just so much more overhead in doing a barrier for the reduction? Or could it be memory access issue(eg cache thrashing)? From what I read with most implementations of openmp threads get reused overtime(eg pooled), so I am not so sure the first problem is what is wrong.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
int ser[29], end, i, j, a, limit, als;
als = atoi(argv[1]);
limit = atoi(argv[2]);
for (i = 2; i < limit; i++) {
ser[0] = i;
for (a = 1; a <= als; a++) {
ser[a] = 1;
int prev = ser[a-1];
if ((prev > i) || (a == 1)) {
end = sqrt(prev);
int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) num_threads(numThread)//added this
for (j = 2; j <= end; j++) {
if (prev % j == 0) {
sum += j;
sum += prev / j;
}
}
ser[a] = sum + 1;//added this
}
}
if (ser[als] == i) {
printf("%d", i);
for (j = 1; j < als; j++) {
printf(", %d", ser[j]);
}
printf("\n");
}
}
}

OpenMP thread teams are instantiated on entering the parallel section. This means, indeed, that the thread creation is repeated every time the inner loop is starting.
To enable reuse of threads, use a larger parallel section (to control the lifetime of the team) and specificly control the parallellism for the outer/inner loops, like so:
Execution time for test.exe 1 1000000 has gone down from 43s to 22s using this fix (and the number of threads reflects the numThreads defined value + 1
PS Perhaps stating the obvious, it would not appear that parallelizing the inner loop is a sound performance measure. But that is likely the whole point to this exercise, and I won't critique the question for that.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
int ser[29], end, i, j, a, limit, als;
als = atoi(argv[1]);
limit = atoi(argv[2]);
#pragma omp parallel num_threads(numThread)
{
#pragma omp single
for (i = 2; i < limit; i++) {
ser[0] = i;
for (a = 1; a <= als; a++) {
ser[a] = 1;
int prev = ser[a-1];
if ((prev > i) || (a == 1)) {
end = sqrt(prev);
int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) //added this
for (j = 2; j <= end; j++) {
if (prev % j == 0) {
sum += j;
sum += prev / j;
}
}
ser[a] = sum + 1;//added this
}
}
if (ser[als] == i) {
printf("%d", i);
for (j = 1; j < als; j++) {
printf(", %d", ser[j]);
}
printf("\n");
}
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio