how does std::this_thread::yield() works - c++11

std::this_thread::yield() leaves current CPU time slice to other threads/process.
On Windows platform, each CPU time slice should cost 2~10ms.
Which means the time gap should be larger than 2~10ms when I use std::this_thread::yield() in the while 1 loop.
I create a test program
`
void thread_function_2() {
double dt;
LARGE_INTEGER nFreq;
QueryPerformanceFrequency(&nFreq);
LARGE_INTEGER tstart;
LARGE_INTEGER tend;
QueryPerformanceCounter(&tstart);
for (int a =0; a< 100; a++)
{
std::this_thread::yield();
}
QueryPerformanceCounter(&tend);
cout << "-----------------hit 100 times yield cost time-----------------" << endl;
dt = (tend.QuadPart - tstart.QuadPart) / (double)nFreq.QuadPart;
cout << "Total time :" << dt * 1000000 << "us" << endl;//dt
}
int main()
{
std::thread thread_2(thread_function_2);
system("pause");
return 1;
}
`
The output is
-----------------hit 100 times yield cost time-----------------
Total time :9.9us
That means call the 100times std::this_thread::yield() totally cost 9.9us.
My understanding is each time I call the std::this_thread::yield(), it should leave the CPU time slice, at least const 2ms, then it should cost at least 200ms for the 100 cycle std::this_thread::yield().

Related

Async doesn't work for long vectors

I am doing some parallel programming with async. I have an integrator and in a test program I wanted to see whether if dividing a vector in 4 subvectors actually takes one fourth of the time to complete the task.
I had an initial issue about the time measured, now solved as steady_clock() measures real and not CPU time.
I tried the code with different vector lenghts. For short lenghts (<10e5 elements) the direct integration is faster: normal, as the .get() calls and the sum take their time.
For intermediate lenghts (about 1e8 elements) the integration followed the expected time, giving 1 s as the first time and 0.26 s for the second time.
For long vectors(10e9 or higher) the second integration takes much more time than the first, more than 3 s against a similar or greater time.
Why? What is the process that makes the divide and conquer routine slower?
A couple of additional notes: Please note that I pass the vectors by reference, so that cannot be the issue, and keep in mind that this is a test code, thus the subvector creation is not the point of the question.
#include<iostream>
#include<vector>
#include<thread>
#include<future>
#include<ctime>
#include<chrono>
using namespace std;
using namespace chrono;
typedef steady_clock::time_point tt;
double integral(const std::vector<double>& v, double dx) //simpson 1/3
{
int n=v.size();
double in=0.;
if(n%2 == 1) {in+=v[n-1]*v[n-1]; n--;}
in=(v[0]*v[0])+(v[n-1]*v[n-1]);
for(int i=1; i<n/2; i++)
in+= 2.*v[2*i] + 4.*v[2*i+1];
return in*dx/3.;
}
int main()
{
double h=0.001;
vector<double> v1(100000,h); // a vector, content is not important
// subvectors
vector<double> sv1(v1.begin(), v1.begin() + v1.size()/4),
sv2(v1.begin() + v1.size()/4 +1,v1.begin()+ 2*v1.size()/4),
sv3( v1.begin() + 2*v1.size()/4+1, v1.begin() + 3*v1.size()/4+1),
sv4( v1.begin() + 3*v1.size()/4+1, v1.end());
double a,b;
cout << "f1" << endl;
tt bt1 = chrono::steady_clock::now();
// complete integration: should take time t
a=integral(v1, h);
tt et1 = chrono::steady_clock::now();
duration<double> time_span = duration_cast<duration<double>>(et1 - bt1);
cout << time_span.count() << endl;
future<double> f1, f2,f3,f4;
cout << "f2" << endl;
tt bt2 = chrono::steady_clock::now();
// four integrations: should take time t/4
f1 = async(launch::async, integral, ref(sv1), h);
f2 = async(launch::async, integral, ref(sv2), h);
f3 = async(launch::async, integral, ref(sv3), h);
f4 = async(launch::async, integral, ref(sv4), h);
b=f1.get()+f2.get()+f3.get()+f4.get();
tt et2 = chrono::steady_clock::now();
duration<double> time_span2 = duration_cast<duration<double>>(et2 - bt2);
cout << time_span2.count() << endl;
cout << a << " " << b << endl;
return 0;
}

Intel VTune Results Understanding - Naive Questions

My application I want to speedup performs element-wise processing of large array (about 1e8 elements).
​The processing procedure for each element is very simple and I suspect that bottleneck could be not CPU but DRAM bandwidth.
​So I decided to study one-threaded version at first.
The system is: Windows 10 64bit, 32 GB RAM, Intel Core i7-3770S Ivybridge 1.10 GHz 4 cores, Hyperthreading enabled
Concurrency analysis
Elapsed Time: 34.425s
CPU Time: 14.908s
Effective Time: 14.908s
Idle: 0.005s
Poor: 14.902s
Ok: 0s
Ideal: 0s
Over: 0s
Spin Time: 0s
Overhead Time: 0s
Wait Time: 0.000s
Idle: 0.000s
Poor: 0s
Ok: 0s
Ideal: 0s
Over: 0s
Total Thread Count: 2
Paused Time: 18.767s
Memory Access Analysis
Memory Access Analysis provides different CPU times for three consecutive  runs on the same amount of data
​Actual execution time was about 23 seconds as Concurrency Analysis says.
Elapsed Time: 33.526s
CPU Time: 5.740s
Memory Bound: 38.3%
L1 Bound: 10.4%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 0.8%
Memory Bandwidth: 36.1%
Memory Latency: 60.4%
Loads: 12,912,960,000
Stores: 7,720,800,000
LLC Miss Count: 420,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 18.081s
Elapsed Time: 33.011s
CPU Time: 4.501s
Memory Bound: 36.9%
L1 Bound: 10.6%
L2 Bound: 0.0%
L3 Bound: 0.2%
DRAM Bound: 0.6%
Memory Bandwidth: 36.5%
Memory Latency: 62.7%
Loads: 9,836,100,000
Stores: 5,876,400,000
LLC Miss Count: 180,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 17.913s
Elapsed Time: 33.738s
CPU Time: 5.999s
Memory Bound: 38.5%
L1 Bound: 10.8%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 0.9%
Memory Bandwidth: 57.8%
Memory Latency: 37.3%
Loads: 13,592,760,000
Stores: 8,125,200,000
LLC Miss Count: 660,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 18.228s
As far as I understand the Summary Page, the situation is not very good.
The paper Finding your Memory Access performance bottlenecks says that the reason is so-called false sharing. But I do not use multithreading, all processing is performed by  just one thread.
From the other hand according to Memory Access Analysis/Platform Page DRAM Bandwidth is not bottleneck.
So the questions are
Why CPU times metric values are different for Concurrency Analysis and Memory Access Analysis
What is the reason of bad memory metrics values, especially for L1 Bound?
The main loop is lambda function, where
tasklets: std::vector of simple structures that contain coefficients for data processing
points: data itself, Eigen::Matrix
projections: Eigen::Matrix, array to put results of processing into
The code is:
#include <iostream>
#include <future>
#include <random>
#include <Eigen/Dense>
#include <ittnotify.h>
using namespace std;
using Vector3 = Eigen::Matrix<float, 3, 1>;
using Matrix3X = Eigen::Matrix<float, 3, Eigen::Dynamic>;
uniform_real_distribution<float> rnd(0.1f, 100.f);
default_random_engine gen;
class Tasklet {
public:
Tasklet(int p1, int p2)
:
p1Id(p1), p2Id(p2), Loc0(p1)
{
RestDistance = rnd(gen);
Weight_2 = rnd(gen);
}
__forceinline void solve(const Matrix3X& q, Matrix3X& p)
{
Vector3 q1 = q.col(p1Id);
Vector3 q2 = q.col(p2Id);
for (int i = 0; i < 0; ++i) {
Vector3 delta = q2 - q1;
float norm = delta.blueNorm() * delta.hypotNorm();
}
Vector3 deltaQ = q2 - q1;
float dist = deltaQ.norm();
Vector3 deltaUnitVector = deltaQ / dist;
p.col(Loc0) = deltaUnitVector * RestDistance * Weight_2;
}
int p1Id;
int p2Id;
int Loc0;
float RestDistance;
float Weight_2;
};
typedef vector<Tasklet*> TaskList;
void
runTest(const Matrix3X& points, Matrix3X& projections, TaskList& tasklets)
{
size_t num = tasklets.size();
for (size_t i = 0; i < num; ++i) {
Tasklet* t = tasklets[i];
t->solve(points, projections);
}
}
void
prepareData(Matrix3X& points, Matrix3X& projections, int numPoints, TaskList& tasklets)
{
points.resize(3, numPoints);
projections.resize(3, numPoints);
points.setRandom();
/*
for (int i = 0; i < numPoints; ++i) {
points.col(i) = Vector3(1, 0, 0);
}
*/
tasklets.reserve(numPoints - 1);
for (int i = 1; i < numPoints; ++i) {
tasklets.push_back(new Tasklet(i - 1, i));
}
}
int
main(int argc, const char** argv)
{
// Pause VTune data collection
__itt_pause();
cout << "Usage: <exefile> <number of points (in thousands)> <#runs for averaging>" << endl;
int numPoints = 150 * 1000;
int numRuns = 1;
int argNo = 1;
if (argc > argNo) {
istringstream in(argv[argNo]);
int i;
in >> i;
if (in) {
numPoints = i * 1000;
}
}
++argNo;
if (argc > argNo) {
istringstream in(argv[argNo]);
int i;
in >> i;
if (in) {
numRuns = i;
}
}
cout
<< "Running test" << endl
<< "\t NumPoints (thousands): " << numPoints / 1000. << endl
<< "\t # of runs for averaging: " << numRuns << endl;
Matrix3X q, projections;
TaskList tasklets;
cout << "Preparing test data" << endl;
prepareData(q, projections, numPoints, tasklets);
cout << "Running test" << endl;
// Resume VTune data collection
__itt_resume();
for (int r = 0; r < numRuns; ++r) {
runTest(q, projections, tasklets);
}
// Pause VTune data collection
__itt_pause();
for (auto* t : tasklets) {
delete t;
}
return 0;
}
Thank you.

C++ - Function is completely skipped if an internal variable exceeds ~60,000

I wrote the following for a class, but came across some strange behavior while testing it. arrayProcedure is meant to do things with an array based on the 2 "tweaks" at the top of the function (arrSize, and start). For the assignment, arrSize must be 10,000, and start, 100. Just for kicks, I decided to see what happens if I increase them, and for some reason, if arrSize exceeds around 60,000 (I haven't found the exact limit), the program immediately crashes with a stack overflow when using a debugger:
Unhandled exception at 0x008F6977 in TMA3Question1.exe: 0xC00000FD: Stack overflow (parameters: 0x00000000, 0x00A32000).
If I just run it without a debugger, I don't get any helpful errors; windows hangs for a fraction of a second, then gives me an error TMA3Question1.exe has stopped working.
I decided to play around with debugging it, but that didn't shed any light. I placed breaks above and below the call to arrayProcedure, as well as peppered inside of it. When arrSize doesn't exceed 60,000 it runs fine: It pauses before calling arrayProcedure, properly waits at all the points inside of it, then pauses on the break underneath the call.
If I raise arrSize however, the break before the call happens, but it appears as though it never even steps into arrayProcedure; it immediately gives me a stack overflow without pausing at any of the internal breakpoints.
The only thing I can think of is the resulting arrays exceeds my computer's current memory, but that doesn't seem likely for a couple reasons:
It should only use just under a megabyte:
sizeof(double) = 8 bytes
8 * 60000 = 480000 bytes per array
480000 * 2 = 960000 bytes for both arrays
As far as I know, arrays aren't immediately constructed when I function is entered; they're allocated on definition. I placed several breakpoints before the arrays are even declared, and they are never reached.
Any light that you could shed on this would be appreciated.
The code:
#include <iostream>
#include <ctime>
//CLOCKS_PER_SEC is a macro supplied by ctime
double msBetween(clock_t startTime, clock_t endTime) {
return endTime - startTime / (CLOCKS_PER_SEC * 1000.0);
}
void initArr(double arr[], int start, int length, int step) {
for (int i = 0, j = start; i < length; i++, j += step) {
arr[i] = j;
}
}
//The function we're going to inline in the next question
void helper(double a1, double a2) {
std::cout << a1 << " * " << a2 << " = " << a1 * a2 << std::endl;
}
void arrayProcedure() {
const int arrSize = 70000;
const int start = 1000000;
std::cout << "Checking..." << std::endl;
if (arrSize > INT_MAX) {
std::cout << "Given arrSize is too high and exceeds the INT_MAX of: " << INT_MAX << std::endl;
return;
}
double arr1[arrSize];
double arr2[arrSize];
initArr(arr1, start, arrSize, 1);
initArr(arr2, arrSize + start - 1, arrSize, -1);
for (int i = 0; i < arrSize; i++) {
helper(arr1[i], arr2[i]);
}
}
int main(int argc, char* argv[]) {
using namespace std;
const clock_t startTime = clock();
arrayProcedure();
clock_t endTime = clock();
cout << endTime << endl;
double elapsedTime = msBetween(startTime, endTime);
cout << "\n\n" << elapsedTime << " milliseconds. ("
<< elapsedTime / 60000 << " minutes)\n";
}
The default stack size is 1 MB with Visual Studio.
https://msdn.microsoft.com/en-us/library/tdkhxaks.aspx
You can increase the stack size or use the new operator.
double *arr1 = new double[arrSize];
double *arr2 = new double[arrSize];
...
delete [] arr1;
delete [] arr2;

Is fftw output depending on size of input?

In the last week i have been programming some 2-dimensional convolutions with FFTW, by passing to the frequency domain both signals, multiplying, and then coming back.
Surprisingly, I am getting the correct result only when input size is less than a fixed number!
I am posting some working code, in which i take simple initial constant matrixes of value 2 for the input, and 1 for the filter on the spatial domain. This way, the result of convolving them should be a matrix of the average of the first matrix values, i.e., 2, since it is constant. This is the output when I vary the sizes of width and height from 0 to h=215, w=215 respectively; If I set h=216, w=216, or greater, then the output gets corrupted!! I would really appreciate some clues about where could I be making some mistake. Thank you very much!
#include <fftw3.h>
int main(int argc, char* argv[]) {
int h=215, w=215;
//Input and 1 filter are declared and initialized here
float *in = (float*) fftwf_malloc(sizeof(float)*w*h);
float *identity = (float*) fftwf_malloc(sizeof(float)*w*h);
for(int i=0;i<w*h;i++){
in[i]=5;
identity[i]=1;
}
//Declare two forward plans and one backward
fftwf_plan plan1, plan2, plan3;
//Allocate for complex output of both transforms
fftwf_complex *inTrans = (fftwf_complex*) fftw_malloc(sizeof(fftwf_complex)*h*(w/2+1));
fftwf_complex *identityTrans = (fftwf_complex*) fftw_malloc(sizeof(fftwf_complex)*h*(w/2+1));
//Initialize forward plans
plan1 = fftwf_plan_dft_r2c_2d(h, w, in, inTrans, FFTW_ESTIMATE);
plan2 = fftwf_plan_dft_r2c_2d(h, w, identity, identityTrans, FFTW_ESTIMATE);
//Execute them
fftwf_execute(plan1);
fftwf_execute(plan2);
//Multiply in frequency domain. Theoretically, no need to multiply imaginary parts; since signals are real and symmetric
//their transform are also real, identityTrans[i][i] = 0, but i leave here this for more generic implementation.
for(int i=0; i<(w/2+1)*h; i++){
inTrans[i][0] = inTrans[i][0]*identityTrans[i][0] - inTrans[i][1]*identityTrans[i][1];
inTrans[i][1] = inTrans[i][0]*identityTrans[i][1] + inTrans[i][1]*identityTrans[i][0];
}
//Execute inverse transform, store result in identity, where identity filter lied.
plan3 = fftwf_plan_dft_c2r_2d(h, w, inTrans, identity, FFTW_ESTIMATE);
fftwf_execute(plan3);
//Output first results of convolution(in, identity) to see if they are the average of in.
for(int i=0;i<h/h+4;i++){
for(int j=0;j<w/w+4;j++){
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(w*h*w*h) << endl;
}
}std::cout<<endl;
//Compute average of data
float sum=0.0;
for(int i=0; i<w*h;i++)
sum+=in[i];
std::cout<<"Mean of input was " << (float)sum/(w*h) << endl;
std::cout<< endl;
fftwf_destroy_plan(plan1);
fftwf_destroy_plan(plan2);
fftwf_destroy_plan(plan3);
return 0;
}
Your problem has nothing to do with fftw ! It comes from this line :
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(w*h*w*h) << endl;
if w=216 and h=216 then `w*h*w*h=2 176 782 336. The higher limit for signed 32bit integer is 2 147 483 647. You are facing an overflow...
Solution is to cast the denominator to float.
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(((float)w)*h*w*h) << endl;
The next trouble that you are going to face is this one :
float sum=0.0;
for(int i=0; i<w*h;i++)
sum+=in[i];
Remember that a float has 7 useful decimal digits. If w=h=4000, the computed average will be lower than the real one. Use a double or write two loops and sum on the inner loop (localsum) before summing the outer loop (sum+=localsum) !
Bye,
Francis

OpenMP - executing threads on chunks

I have the following piece of code, which I want to make parallel in a certain way. I am making a mistake, and hence not all threads are running the loop as I thought it should. It would be great if somebody could help me out identifying that mistake.
This is a code to calculate histograms.
#pragma omp parallel default(shared) private(iIndex2, iIndex1, fDist) shared(iSize, dense) reduction(+:iCount)
{
chunk = (unsigned int)(iSize / omp_get_num_threads());
threadID = omp_get_thread_num();
svtout << "Number of threads available " << omp_get_num_threads() << endl;
svtout << "The threadID is " << threadID << endl;
//want each of the thread to execute the loop
for (iIndex1=0; iIndex1 < chunk; iIndex1++)
{
for (iIndex2=iIndex1+1; iIndex2 < chunk; iIndex2++)
{
iCount++;
fDist = (*this)[iIndex1 + threadID*chunk].distance( (*this)[iIndex2 + threadID*chunk] );
idx = (int)(fDist/fWidth);
if ((int)fDist % (int)fWidth >= 0)
{
#pragma omp atomic
dense[idx] += 1;
}
}
}
The iCount variable keeps track of the number of iterations, and I noticed that there is a marked difference between the serial and the parallel version. I guess not all threads are running, and hence the histogram values that I'm obtaining from the parallel program are much less than the actual readings (the dense array stores the histogram values).
Thanks,
Sayan
you are a looping over chunk, rather than iSize with more than one thread.
Try replacing loop bounds with iSize .

Resources