C++ Vector appear to be faster than C Array (Time). Why ? - xcode

Hello I have seen that C++ Vector vs Array (Time).
On my mac the vector take times to be defined but after the comparison give vector for winner.
How it works ?
I was said int[] are faster than dynamic vector ?
#include <iostream>
#include <vector>
using namespace std;
#define N (100000000)
//int sd[N];
int main() {
clock_t start;
double temps;
static int sd[N];
start = clock();
for (unsigned long i=0 ; i < N ; i++){
if(sd[i]==3)
;
}
temps = (clock() - start) / (double)(CLOCKS_PER_SEC / 1000);
printf("Time: %f ms\n",temps);
vector<int>vd(N);
start = clock();
for (unsigned long i=0 ; i < N ; i++){
if(vd[i]==3)
;
}
temps = (clock() - start) / (double)(CLOCKS_PER_SEC / 1000);
printf("Time: %f ms\n",temps);
while (1)
;
return 0;
}
I have those results :
Time: 422.87400 ms
Time: 300.84700 ms
Even if it begining with vector, vector appear to be faster than c array.
Thank You for your explaination.
Another question : in xcode, why i see memory used by declation vector and for static c array I have to go all the memory cells as in the code (for ... if(sd[i]...)
Thank You for your explaination.

I have remarqued that if i initialize all the c array cells at 0 (for example or 6...) the c array will be faster or equal vector.
int main() {
clock_t start;
double temps;
static int sd[N];
for (unsigned long i=0 ; i < N ; i++){
sd[i]=0;
}
start = clock();
//puts("initialized");
for (unsigned long i=0 ; i < N ; i++){
if(sd[i]==3)
;
}
temps = (clock() - start) / (double)(CLOCKS_PER_SEC / 1000);
printf("Time: %f ms\n",temps);
//puts("initialized");
vector<int>vd(N);
start = clock();
for (unsigned long i=0 ; i < N ; i++){
if(vd[i]==3)
;
}
temps = (clock() - start) / (double)(CLOCKS_PER_SEC / 1000);
printf("Time: %f ms\n",temps);
while (1)
;
return 0;
}
And I will see the memory used in xcode with the initialization at 0 of all cells c array.
So another question , why it is more rapid when you initialize in this case (or in general) ?

Related

Speed up random memory access using prefetch

I am trying to speed up a single program by using prefetches. The purpose of my program is just for test. Here is what it does:
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU
In the very first time, values in the first buffers contains the values of its index (cf. function createIndexBuffer in the code just below) .
It will be more clear in the code of my program:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 100000)
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = i;
}
return (indexBuffer);
}
unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
unsigned long long sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
unsigned int index = indexBuffer[i];
sum += valueBuffer[index];
}
return (sum);
}
unsigned int computeTimeInMicroSeconds()
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
unsigned long long sum = computeSum(indexBuffer, valueBuffer);
gettimeofday(&endTime, NULL);
printf("Sum = %llu\n", sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
int main()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds();
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
If I launch it, I get the following output:
$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch
sizeof buffers = 1562Mb
Sum = 439813150288855829
Time: 201172 micro-seconds = 0.201 seconds
Quick and fast!!!
According to my knowledge (I may be wrong), one of the reason for having such a fast program is that, as I access my two buffers sequentially, data can be prefetched in the CPU cache.
We can make it more complex in order that data is (almost) prefeched in CPU cache. For example, we can just change the createIndexBuffer function in:
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
Let's try the program once again:
$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch
sizeof buffers = 1562Mb
Sum = 439835307963131237
Time: 3730387 micro-seconds = 3.730 seconds
More than 18 times slower!!!
We now arrive to my problem. Given the new createIndexBuffer function, I would like to speed up computeSum function using prefetch
unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
unsigned long long sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &indexBuffer[i + 1], 0, 0);
unsigned int index = indexBuffer[i];
sum += valueBuffer[index];
}
return (sum);
}
of course I also have to change my createIndexBuffer in order it allocates a buffer having one more element
I relaunch my program: not better! As prefetch may be slower than one "for" loop iteration, I may prefetch not one element before but two elements before
__builtin_prefetch((char *) &indexBuffer[i + 2], 0, 0);
not better! two loops iterations? not better? Three? **I tried it until 50 (!!!) but I cannot enhance the performance of my function computeSum.
Can I would like help to understand why
Thank you very much for your help
I believe that above code is automatically optimized by CPU without any further space for manual optimization.
1. Main problem is that indexBuffer is sequentially accessed. Hardware prefetcher senses it and prefetches further values automatically, without need to call prefetch manually. So, during iteration #i, values indexBuffer[i+1], indexBuffer[i+2],... are already in cache. (By the way, there is no need to add artificial element to the end of array: memory access errors are silently ignored by prefetch instructions).
What you really need to do is to prefetch valueBuffer instead:
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + 1]], 0, 0);
2. But adding above line of code won't help either in such simple scenario. Cost of accessing memory is hundreds of cycles, while add instruction is ~1 cycle. Your code already spends 99% of time in memory accesses. Adding manual prefetch will make it this one cycle faster and no better.
Manual prefetch would really work well if your math were much more heavy (try it), like using an expression with large number of non-optimized out divisions (20-30 cycles each) or calling some math function (log, sin).
3. But even this doesn't guarantee to help. Dependency between loop iterations is very weak, it is only via sum variable. This allows CPU to execute instructions speculatively: it may start fetching valueBuffer[i+1] concurrently while still executing math for valueBuffer[i].
Prefetch fetches normally a full cache line. This is typically 64 bytes. So the random example fetches always 64 bytes for a 4 byte int. 16 times the data you actually need which fits very well with the slow down by a factor of 18. So the code is simply limited by memory throughput and not latency.
Sorry. What I gave you was not the correct version of my code. The correct version is, what you said:
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
However, even with the right version, it is unfortunately not better
Then I adapted my program to try your suggestion using the sin function.
My adapted program is the following one:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 50000)
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer(unsigned short prefetchStep)
{
unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + prefetchStep) * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
double sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
unsigned int index = indexBuffer[i];
sum += sin(valueBuffer[index]);
}
return (sum);
}
unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer(prefetchStep);
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);
gettimeofday(&endTime, NULL);
printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
int main()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
{
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
}
The output is:
$ gcc TestPrefetch.c -O3 -o TestPrefetch -lm && taskset -c 7 ./TestPrefetch
sizeof buffers = 781Mb
prefetchStep = 0, Sum = -1107.523504 - Time: 20895326 micro-seconds = 20.895 seconds
prefetchStep = 1, Sum = 13456.262424 - Time: 12706720 micro-seconds = 12.707 seconds
prefetchStep = 2, Sum = -20179.289469 - Time: 12136174 micro-seconds = 12.136 seconds
prefetchStep = 3, Sum = 12068.302534 - Time: 11233803 micro-seconds = 11.234 seconds
prefetchStep = 4, Sum = 21071.238160 - Time: 10855348 micro-seconds = 10.855 seconds
prefetchStep = 5, Sum = -22648.280105 - Time: 10517861 micro-seconds = 10.518 seconds
prefetchStep = 6, Sum = 22665.381676 - Time: 9205809 micro-seconds = 9.206 seconds
prefetchStep = 7, Sum = 2461.741268 - Time: 11391088 micro-seconds = 11.391 seconds
...
So here, it works better! Honestly, I was almost sure that it will not be better because the math function cost is higher compared to the memory access.
If anyone could give me more information about why it is better now, I would appreciate it
Thank you very much

Finding an efficient algorithm

You are developing a smartphone app. You have a list of potential
customers for your app. Each customer has a budget and will buy the app at
your declared price if and only if the price is less than or equal to the
customer's budget.
You want to fix a price so that the revenue you earn from the app is
maximized. Find this maximum possible revenue.
For instance, suppose you have 4 potential customers and their budgets are
30, 20, 53 and 14. In this case, the maximum revenue you can get is 60.
**Input format**
Line 1 : N, the total number of potential customers.
Lines 2 to N+1: Each line has the budget of a potential customer.
**Output format**
The output consists of a single integer, the maximum possible revenue you
can earn from selling your app.
Also, upper bound on N is 5*(10^5) and upper bound on each customer's budget is 10^8.
This is a problem I'm trying to solve . My strategy was to sort the list of budgets and then multiply each of those with its position-index in the sequence - and then print the max of the resulting sequence. However this seems to be quite time-inefficient (at least in the way I'm implementing it - I've attached the code for reference). My upper bound on time is 2 seconds. Can anyone help me find a
more time-efficient algorithm (or possibly a more efficient way to implement my algorithm) ?
Here is my solution :
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
using namespace std;
long long max(long long[],long long);
void quickSortIterative(long long[],long long,long long);
long long partition(long long[],long long,long long);
void swap(long long*,long long*);
int main(){
long long n,k=1;
scanf("%lld",&n);
if(n<1 || n > 5*((long long)pow(10,5))){
exit(0);
}
long long budget[n],aux[n];
for(long long i=0;i<n;i++){
scanf("%lld",&budget[i]);
if(budget[i]<1 || budget[i] > (long long)pow(10,8)){
exit(0);
}
}
quickSortIterative(budget,0,n-1);
for(long long j=n-1;j>=0;j--){
aux[j] = budget[j]*k;
k++;
}
cout<<max(aux,n);
return 0;
}
long long partition (long long arr[], long long l, long long h){
long long x = arr[h];
long long i = (l - 1);
for (long long j = l; j <= h- 1; j++)
{
if (arr[j] <= x)
{
i++;
swap (&arr[i], &arr[j]);
}
}
swap (&arr[i + 1], &arr[h]);
return (i + 1);
}
void swap ( long long* a, long long* b ){
long long t = *a;
*a = *b;
*b = t;
}
void quickSortIterative(long long arr[], long long l, long long h){
long long stack[ h - l + 1 ];
long long top = -1;
stack[ ++top ] = l;
stack[ ++top ] = h;
while ( top >= 0 ){
h = stack[ top-- ];
l = stack[ top-- ];
long long p = partition( arr, l, h );
if ( p-1 > l ){
stack[ ++top ] = l;
stack[ ++top ] = p - 1;
}
if ( p+1 < h ){
stack[ ++top ] = p + 1;
stack[ ++top ] = h;
}
}
}
long long max(long long arr[],long long length){
long long max = arr[0];
for(long long i=1;i<length;i++){
if(arr[i]>max){
max=arr[i];
}
}
return max;
}
Quicksort can take O(n^2) time for certain sequences (often already sorted sequences are bad).
I would recommend you try using a sorting approach with guaranteed O(nlogn) performance (e.g. heapsort or mergesort). Alternatively, you may well find that using the sort routines in the standard library will give better performance than your version.
You might use qsort in C or std::sort in C++, which is most likely faster than your own code.
Also, your "stack" array will cause you trouble if the difference h - l is large.
I have used STL library function sort() of C++. It's time complexity is O(nlogn). Here, you just need to sort the given array and check from maximum value to minimum value for given solution. It is O(n) after sorting.
My code which cleared all the test cases :
#include <algorithm>
#include <stdio.h>
#include <cmath>
#include <iostream>
using namespace std;
int main(){
long long n, a[1000000], max;
int i, j;
cin>>n;
for(i = 0; i < n; i++){
cin>>a[i];
}
sort(a, a + n);
max = a[n - 1];
for(i = n - 2; i >= 0; i--){
//printf("%lld ", a[i]);
if(max < (a[i] * (n - i)))
max = a[i] * (n - i);
}
cout<<max<<endl;
return 0;
}
I dont know if my answer is right or wrong please point out mistakes if there is any
#include<stdio.h>
void main()
{
register int i,j;
long long int n,revenue;
scanf("%Ld",&n);
long long int a[n];
for(i=0;i<n;i++)
scanf("%Ld",&a[i]);
for (i=0;i<n;i++)
{
for(j=i+1;j<n;j++)
{
if(a[i]>a[j])
{
a[i]=a[i]+a[j];
a[j]=a[i]-a[j];
a[i]=a[i]-a[j];
}
}
}
for(i=0;i<n;i++)
a[i]=(n-i)*a[i];
revenue=0;
for(i=0;i<n;i++)
{
if(revenue<a[i])
revenue=a[i];
}
printf("%Ld\n",revenue);
}
passed all the test cases
n=int(input())
r=[]
for _ in range(n):
m=int(input())
r.append(m)
m=[]
r.sort()
l=len(r)
for i in range(l):
m.append((l-i)*r[i])
print(max(m))
#include <iostream>
#include <bits/stdc++.h>
using namespace std;
int main() {
// your code goes here
long long n;
std::cin >> n;
long long a[n];
for(long long i=0;i<n;i++)
{
std::cin >> a[i];
}
sort(a,a+n);
long long max=LONG_MIN,count;
for(long long i=0;i<n;i++)
{
if(a[i]*(n-i)>max)
{
max=a[i]*(n-i);
}
}
std::cout << max << std::endl;
return 0;
}
The following solution is in C programming Language.
The Approach is:
Input the number of customers.
Input the budgets of customers.
Sort the budget.
Assign revenue=0
Iterate through the budget and Multiply the particular budget with the remaining budget values.
If the previous-revenue < new-revenue. assign the new-revenue to revenue variable.
The code is as follows:
#include <stdio.h>
int main(void) {
int i,j,noOfCustomer;
scanf("%d",&noOfCustomer);
long long int budgetOfCustomer[noOfCustomer],maximumRevenue=0;
for(i=0;i<noOfCustomer;i++)
{
scanf("%Ld",&budgetOfCustomer[i]);
}
for(i=0;i<noOfCustomer;i++)
{
for(j=i+1;j<noOfCustomer;j++)
{
if(budgetOfCustomer[i]>budgetOfCustomer[j])
{
budgetOfCustomer[i]=budgetOfCustomer[i] + budgetOfCustomer[j];
budgetOfCustomer[j]=budgetOfCustomer[i] - budgetOfCustomer[j];
budgetOfCustomer[i]=budgetOfCustomer[i] - budgetOfCustomer[j];
}
}
}
for(i=0;i<noOfCustomer;i++)
{
budgetOfCustomer[i]=budgetOfCustomer[i]*(noOfCustomer-i);
}
for(i=0;i<noOfCustomer;i++)
{
if(maximumRevenue<budgetOfCustomer[i])
maximumRevenue=budgetOfCustomer[i];
}
printf("%Ld\n",maximumRevenue);
return 0;
}

Why are the timings for the vectorized reduction for a simple Riemann sum-integral on Xeon Phi so bad?

I am new to the Xeon Phi and so I am going through the manuals trying to understand how
to improve performance on the Phi using the vector registers.
Consider the short code at the end of this question which calculates the area under the curve 4/(1+x^2) on [0,1] using a Riemann sum. The analytic answer is pi = 3.14159....
The code basically consists of two nearly identical chunks of code which use OpenMP
to calculate the answer using 4 threads. The only difference is that in the second
chunk I am using the vectorized function __sec_reduce_add() to compute the Riemann
sum of the sub-domain of [0,1] given to the thread.
The timings for the first chunk of the code is 0.0866439 s and for the second(vectorized) chunk it is 0.0868771 s
Why did these both yield nearly the same timings. I would have thought that using the vector
register would have significantly improved the performance.
I compiled this with icc -mmic -vec-report3 -openmp flags
[Note: I have put a for loop with the rpt variable over the two sections, because rpt=0 and rpt=1 are "warm-up" loops and so will have somewhat higher timings. I have given the timings of the two sections at rpt=3]
#include <iostream>
#include <omp.h>
using namespace std;
int main (void)
{
int num_steps = 2e8 ;
double dx = 1.0/num_steps ;
double x = 0. ;
double* fn = new double[num_steps];
// Initialize an array containing function values
for(int i=0 ; i<num_steps ;++i )
{
fn[i] = 4.0*dx/(1.0 + x*x);
x += dx;
}
for(size_t rpt=0 ; rpt<4 ; ++rpt)
{
double start = omp_get_wtime();
double parallel_sum = 0.;
#pragma omp parallel num_threads(4)
{
int threadIdx = omp_get_thread_num();
int begin = threadIdx * num_steps/4 ; //integer index of left-end point of sub-interval
int end = begin + num_steps/4 ;// integer index of right-end point of sub-interval
double dx_local = dx ;
double temp = 0 ;
double x = begin*dx ;
for (int i = begin; i < end; ++i)
{
temp += fn[i];
}
#pragma omp atomic
parallel_sum += temp;
}
double end = omp_get_wtime();
std::cout << "\nTime taken for the parallel computation: " << end-start << " seconds";
//%%%%%%%%%%%%%%%%%%%%%%%%%
start = omp_get_wtime();
double parallel_vector_sum = 0.;
#pragma omp parallel num_threads(4)
{
int threadIdx = omp_get_thread_num();
int begin = threadIdx * num_steps/4 ; //integer index of left-end point of sub-interval
int end = begin + num_steps/4 ;// integer index of right-end point of sub-interval
double dx_local = dx ;
double temp = 0 ;
double x = begin*dx ;
temp = __sec_reduce_add( fn[begin:end-begin+1] );
#pragma omp atomic
parallel_vector_sum += temp;
}
end = omp_get_wtime();
std::cout << "Time taken for the parallel vector computation: " << end-start << " seconds" ;
}// end for rpt
return 0;
}

Examples of strict aliasing of pointers in GCC C99, no performance differences

I'm trying to understand the impact of strict aliasing on performance in C99. My goal is to optimize a vector dot product, which takes up a large amount of time in my program (profiled it!). I thought that aliasing could be the problem, but the following code doesn't show any substantial difference between the standard approach and the strict aliasing version, even with vectors of size 100 million. I've also tried to use local variables to avoid aliasing, with similar results.
What's happening?
I'm using gcc-4.7 on OSX 10.7.4. Results are in microseconds.
$ /usr/local/bin/gcc-4.7 -fstrict-aliasing -Wall -std=c99 -O3 -o restrict restrict.c
$ ./restrict
sum: 100000000 69542
sum2: 100000000 70432
sum3: 100000000 70372
sum4: 100000000 69891
$ /usr/local/bin/gcc-4.7 -Wall -std=c99 -O0 -fno-strict-aliasing -o restrict restrict.c
$ ./restrict
sum: 100000000 258487
sum2: 100000000 261349
sum3: 100000000 258829
sum4: 100000000 258129
restrict.c (note this code will need several hundred MB RAM):
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <unistd.h>
/* original */
long sum(int *x, int *y, int n)
{
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += x[i] * y[i];
return s;
}
/* restrict */
long sum2(int *restrict x, int *restrict y, int n)
{
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += x[i] * y[i];
return s;
}
/* local restrict */
long sum3(int *x, int *y, int n)
{
int *restrict xr = x;
int *restrict yr = y;
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += xr[i] * yr[i];
return s;
}
/* use local variables */
long sum4(int *x, int *y, int n)
{
int xr, yr;
long i, s = 0;
for(i = 0 ; i < n ; i++)
{
xr = x[i];
yr = y[i];
s += xr * yr;
}
return s;
}
int main(void)
{
struct timeval tp1, tp2;
struct timezone tzp;
long i, n = 1e8L, s;
int *x = malloc(sizeof(int) * n);
int *y = malloc(sizeof(int) * n);
long elapsed1;
for(i = 0 ; i < n ; i++)
x[i] = y[i] = 1;
gettimeofday(&tp1, &tzp);
s = sum(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum2(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum2:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum3(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum3:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum3(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum4:\t%ld\t%ld\n", s, elapsed1);
return EXIT_SUCCESS;
}
Off the cuff:
with no strict aliasing rules, the compiler might simply generate optimized code that does subtly different things than intended.
It is not a given that disabling strict aliasing rules leads to faster code.
If it does, it's also not a given that the optimized code actually show different results. This depends a lot on the actual data access patterns, and often even the processor/cache architecture.
Regarding your example code, I'd say that aliasing is irrelevant (for emitted code, at least) since there is never any write access to the array elements inside the sumXXX functions.
(You might get slightly better performance (or opposite) if you pass the same vector twice. There might be a boon from hot cache and smaller cache footprint. There may be a penalty from redundant Loads putting the prefetch predictor off-track. As always: use a profiler)

Fast Modulo 511 and 127

Is there a way, how to make modulo by 511 (and 127) faster than using "%" operator ?
int c = 758 % 511;
int d = 423 % 127;
Here is a way to do fast modulo by 511 assuming that x is at most 32767. It's about twice as fast as x%511. It does the modulo in five steps: two multiply, two addition, one shift.
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
Here is the theory at how I arrive at this. I posted the code I tested this at the end
Let's consider
y = x/511 = x/(512-1) = x/1000 * 1/(1-1/512).
Let's define z = 512, then
y = x/z*1/(1-1/z).
Using Taylor expansion
y = x/z(1 + 1/z + 1/z^2 + 1/z^3 + ...).
Now if we know that x has a limited range we can cut the expansion. Let's assume x is always less than 2^15=32768. Then we can write
512*512*y = (1+512)*x = 513*x.
After looking at the digits which are significant we arrive at
y = (513*x+64)>>18 //512^2 = 2^18.
We can divide x/511 (assuming x is less than 32768) in three steps:
multiply,
add,
shift.
Here is the code I just to profile this in MSVC2013 64-bit release mode on an Ivy Bridge core.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
int main() {
unsigned int i, x;
volatile unsigned int r;
double dtime;
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = j%511;
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = fast_mod_511(j);
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}
You can use a lookup table with the solutions pre-stored. If you create an array of a million integers looking up is about twice as fast as actually doing modulo in my C# app.
// fill an array
var mod511 = new int[1000000];
for (int x = 0; x < 1000000; x++) mod511[x] = x % 511;
and instead of using
c = 758 % 511;
you use
c = mod511[758];
This will cost you (possibly a lot of) memory, and will obviously not work if you want to use it for very large numbers also. But it is faster.
If you have to repeat those two modulus operations on a large number of data and your CPU supports SIMD (for example Intel's SSE/AVX/AVX2) then you can vectorize the operations, i.e., do the operations on many data in parallel. You can do this by using intrinsics or inline assembly. Yes the solution will be platform specific but maybe that is fine...

Resources