Cuda, why I cannot use more than one streaming processor? - parallel-processing

I implemented a RNS Montgomery exponentiation in Cuda.
Everything nice everything fine. It runs on just one SM.
BUT, so far I focus on parallelization of just a single exp. What I want to do now is test with several exp on fly. That is, I want that the i-th next exp is assign to a free SM.
I tried, and the final time was always growing linearly, that is all the exp were assign to the same SM.
Then I switched to streams, but nothing changed.
However I have never used them, so maybe I am doing something wrong..
This is the code:
void __smeWrapper() {
cudaEvent_t start, stop;
cudaStream_t stream0, stream1, stream2;
float time;
unsigned int j, i, tmp;
cudaEventCreate(&start);
cudaEventCreate(&stop);
dim3 threadsPerBlock(SET_SIZE, (SET_SIZE+1)/2);
setCudaDevice();
s_transferDataToGPU();
if(cudaDeviceSetCacheConfig(cudaFuncCachePreferL1) != cudaSuccess)
printf("cudaDeviceSetCacheConfig ERROR!");
cudaEventRecord( start, 0 );
//for(i=0; i<EXPONENTIATION_NUMBER; i++) {
i=0;
__me<<< 1, threadsPerBlock, 0, stream0 >>>(&__s_x[i*(2*SET_SIZE + 1)], __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar,
__mmi_BinAUar, __mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
i=1;
__me<<< 1, threadsPerBlock, 0, stream1 >>>(&__s_x[i*(2*SET_SIZE + 1)], __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar,
__mmi_BinAUar, __mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
i=2;
__me<<< 1, threadsPerBlock, 0, stream2 >>>(&__s_x[i*(2*SET_SIZE + 1)], __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar, __mmi_BinAUar,
__mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
//printf("\n%s\n\n", cudaGetErrorString(cudaGetLastError()));
//}
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
printf("GPU %f µs : %f ms\n", time*1000, time);
cudaEventDestroy( start );
cudaEventDestroy( stop );
Ubuntu 11.04 64b, Cuda 5 RC, 560 Ti (8 SM)

All threads from a block always run on a same SM. You need to start more then one block to use other SMs.
There seems to be something wrong with your streams - do you call cudaStreamCreate for every stream? On my system it crashes with SEGFAULT if I don't use one though.

Related

Why are my MPI-parallelised DO-loops faster the second time round?

I'm working on a scientific model that's parallelised using Open MPI, and I'm finding some weird results with computation time for parallelised loops. Basically, I find that a parallelised loop over a large array is slow right after allocating the (shared) memory, but a second operation is much faster. Why is this?
To demonstrate, here's a dummy program I wrote that illustrates this:
PROGRAM test_program
USE mpi
USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_PTR, C_F_POINTER
IMPLICIT NONE
INTEGER, PARAMETER :: dp = KIND(1.0D0) ! Double precision
INTEGER :: process_rank, number_of_processes
INTEGER :: ierr
INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
REAL(dp), DIMENSION(:,:), POINTER :: A
INTEGER :: win
INTEGER :: nx, ny, i, j, i1, i2, j1, j2
REAL(dp) :: tstart, tstop, dt1, dt2
! Initialise MPI split processes
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK( MPI_COMM_WORLD, process_rank, ierr)
CALL MPI_COMM_SIZE( MPI_COMM_WORLD, number_of_processes, ierr)
! Dimensions of data array
nx = 20000
ny = 20000
! Allocate MPI-shared memory for data array, with an associated window object
! (done by all processes, but only the master actually allocates any space)
IF (process_rank == 0) THEN
windowsize = ny*nx*8_MPI_ADDRESS_KIND
disp_unit = ny*8
ELSE
windowsize = 0_MPI_ADDRESS_KIND
disp_unit = 1
END IF
CALL MPI_WIN_ALLOCATE_SHARED( windowsize, disp_unit, MPI_INFO_NULL, MPI_COMM_WORLD, baseptr, win, ierr)
IF (.NOT. process_rank == 0) THEN
! Get the baseptr of the master's memory space.
CALL MPI_WIN_SHARED_QUERY( win, 0, windowsize, disp_unit, baseptr, ierr)
END IF
! Associate the pointer with this memory space (so all processes act on the same physical memory)
CALL C_F_POINTER(baseptr, A, [ny, nx])
! Divide the memory over the processors; each gets a range i1-i2 of row
i1 = MAX(1, FLOOR(REAL(nx * process_rank / number_of_processes)) + 1)
i2 = MIN(nx, FLOOR(REAL(nx * (process_rank + 1) / number_of_processes)))
! Do some operations on the data, measure how long this takes on each core
tstart = MPI_WTIME()
DO i = i1, i2
DO j = 1, ny
A( j,i) = SQRT( (REAL(i,dp) - REAL(nx,dp)/2._dp)**2 + (REAL(j,dp) - REAL(ny,dp)/2._dp)**2 )
END DO
END DO
tstop = MPI_WTIME()
dt1 = tstop - tstart
WRITE(0,*) ' Process ', process_rank, ': dt1 = ', dt1
! Do some more calculations on the data, measure how long this takes on each core
tstart = MPI_WTIME()
DO i = i1, i2
DO j = 1, ny
A( j,i) = SQRT( (REAL(i,dp) - REAL(nx,dp)/4._dp)**2 + (REAL(j,dp) - REAL(ny,dp)/3._dp)**2 )
END DO
END DO
tstop = MPI_WTIME()
dt2 = tstop - tstart
WRITE(0,*) ' Process ', process_rank, ': dt2 = ', dt2
CALL MPI_FINALIZE( ierr)
END PROGRAM test_program
This gives me the following output:
Process 0 : dt1 = 2.1637410982511938
Process 1 : dt1 = 2.6094086961820722
Process 0 : dt2 = 1.0437976177781820
Process 1 : dt2 = 0.96576740033924580
Replacing the first loop by a vectorised operation (e.g. A(:,i1:i2) = 0._dp) gives the same "fast" results for the second loop.
When you perform memory allocation (such as with MPI_WIN_ALLOCATE_SHARED), the memory is only allocated virtually by the operating system: so until you use it, it doesn't take up space neither on RAM nor on the disk.
The first iteration writes into the virtually-allocated memory location for the first time, so it causes what is called page faults. When handling a page fault, the operating system tries to make the required page accessible at the location in physical memory (or terminates the program in cases of an illegal memory access). This process is very slow, hence the slow execution.
During the second iteration, no page faults should occur because the memory should be physically mapped, hence the faster execution.
Note that this effect is completely independent of the use of MPI (although MPI_WIN_ALLOCATE_SHARED should allocate memory in a different way than usual since memory must be shared between processes).

How to read any frame while having frame number using ffmpeg av_seek_frame()

int64_t timeBase;
timeBase = (int64_t(pavStrm-> time_base.num) * AV_TIME_BASE) / int64_t(pavStrm->time_base.den);
int64_t seekTarget = int64_t(iFrameNumber) * timeBase;
av_seek_frame(fmt_ctx, -1, seekTarget, AVSEEK_FLAG_FRAME);
here I want to read next 5 frame after iFrameNumebr
for(int iCnt = 0; iCnt <= 4; iCnt++)
{
iRet = av_read_frame(fmt_ctx, &pkt);
do
{
ret = decode_packet(&got_frame, 0);
if (ret < 0)
break;
pkt.data += ret;
pkt.size -= ret;
}while (pkt.size > 0);
av_free_packet(&pkt);
}
static int decode_packet(int *got_frame, int cached)
{
int ret = 0;
int decoded = pkt.size;
*got_frame = 0;
if (pkt.stream_index == video_stream_idx)
{
/* decode video frame */
ret = avcodec_decode_video2(video_dec_ctx, frame, got_frame, &pkt);
}
when i am using AVSEEK_FLAG_BACKWARD its return 5 packet and 5 frame first two is blank but correct.
when i am using AVSEEK_FLAG_FRAME its return 5 packet and 3 frame which are not first 3 frame its return specific frame from video.
for any iFrameNumber
so please help me how to get frame while having frame number and what is exact value of seektarget 3rd param of av_seek_frame()
also I have problem while converting frame to rgb24 format
I think av_seek_frame() is one of the most common but difficult to understand function, also not well commented enough.
If the flag AVSEEK_FLAG_FRAME is set, the third parameter should be a frame number you want to seek, which you're doing fine.
Let's see a example to have a better understand of av_seek_frame():
Say I have a video of 10 frames, with fps=10. The first and fifth frame is key frame (I Frame or intra frame). Others are P frames or even B frames in some format.
0 1 2 3 4 5 6 7 8 9 (frame number)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (timebase)
av_seek_frame(fmt_ctx, -1, 2, AVSEEK_FLAG_FRAME);
av_seek_frame(fmt_ctx, -1, 0.15, 0);
// These will seek to the fifth frame. Cause `AVSEEK_FLAG_ANY` is not given. Seeking to the next key frame after third parameter.
av_seek_frame(fmt_ctx, -1, 2, AVSEEK_FLAG_FRAME | AVSEEK_FLAG_ANY);
// This will seek to exactly the third parameter specified. But probably only a frame with no actual meaning. (We can't get a meaningful image if no related I/P/B frames given.)
av_seek_frame(fmt_ctx, -1, 0.15, AVSEEK_FLAG_ANY);
// Seek to 0.2. Nothing interesting as above.
av_seek_frame(fmt_ctx, -1, 0.15, AVSEEK_FLAG_ANY | AVSEEK_FLAG_BACKWARD);
// Seek to 0.1. Also nothing interesting.
av_seek_frame(fmt_ctx, -1, 2, AVSEEK_FLAG_FRAME | AVSEEK_FLAG_BACKWARD);
// Got the first frame. Seeking to the nearest key frame before the third parameter.
So if I'd like to get arbitrary frame, usually seeking with AVSEEK_FLAG_BACKWARD first, decoding as usual. Then check the first several packets pts and duration, see if we need to drop them.
int64_t FrameToPts(AVStream* pavStream, int frame) const
{
return (int64_t(frame) * pavStream->r_frame_rate.den * pavStream-
>time_base.den) /
(int64_t(pavStream->r_frame_rate.num) *
pavStream->time_base.num);
}
iSeekTarget = FrameToPts(m_pAVVideoStream, max(0, lFrame));
iSuccess = av_seek_frame(m_pAVFmtCtx, m_iVideo_Stream_idx,
iSeekTarget, iSeekFlag);
AVPacket avPacket;
iRet = av_read_frame(m_pAVFmtCtx, &avPacket);
timeBase = (int64_t(video_stream-> time_base.num) * AV_TIME_BASE) / int64_t(video_stream->time_base.den);
int64_t seekTarget = int64_t(iFrameNumber) * timeBase * (video_stream->time_base.den / video_stream->avg_frame_rate.num);
int iiiret = av_seek_frame(fmt_ctx, -1, seekTarget, AVSEEK_FLAG_FRAME);

Faster lookup tables using AVX2

I'm trying to speed up an algorithm which performs a series of lookup tables. I'd like to use SSE2 or AVX2. I've tried using the _mm256_i32gather_epi32 command but it is 31% slower. Does anyone have any suggestions to any improvements or a different approach?
Timings:
C code = 234
Gathers = 340
static const int32_t g_tables[2][64]; // values between 0 and 63
template <int8_t which, class T>
static void lookup_data(int16_t * dst, T * src)
{
const int32_t * lut = g_tables[which];
// Leave this code for Broadwell or Skylake since it's 31% slower than C code
// (gather is 12 for Haswell, 7 for Broadwell and 5 for Skylake)
#if 0
if (sizeof(T) == sizeof(int16_t)) {
__m256i avx0, avx1, avx2, avx3, avx4, avx5, avx6, avx7;
__m128i sse0, sse1, sse2, sse3, sse4, sse5, sse6, sse7;
__m256i mask = _mm256_set1_epi32(0xffff);
avx0 = _mm256_loadu_si256((__m256i *)(lut));
avx1 = _mm256_loadu_si256((__m256i *)(lut + 8));
avx2 = _mm256_loadu_si256((__m256i *)(lut + 16));
avx3 = _mm256_loadu_si256((__m256i *)(lut + 24));
avx4 = _mm256_loadu_si256((__m256i *)(lut + 32));
avx5 = _mm256_loadu_si256((__m256i *)(lut + 40));
avx6 = _mm256_loadu_si256((__m256i *)(lut + 48));
avx7 = _mm256_loadu_si256((__m256i *)(lut + 56));
avx0 = _mm256_i32gather_epi32((int32_t *)(src), avx0, 2);
avx1 = _mm256_i32gather_epi32((int32_t *)(src), avx1, 2);
avx2 = _mm256_i32gather_epi32((int32_t *)(src), avx2, 2);
avx3 = _mm256_i32gather_epi32((int32_t *)(src), avx3, 2);
avx4 = _mm256_i32gather_epi32((int32_t *)(src), avx4, 2);
avx5 = _mm256_i32gather_epi32((int32_t *)(src), avx5, 2);
avx6 = _mm256_i32gather_epi32((int32_t *)(src), avx6, 2);
avx7 = _mm256_i32gather_epi32((int32_t *)(src), avx7, 2);
avx0 = _mm256_and_si256(avx0, mask);
avx1 = _mm256_and_si256(avx1, mask);
avx2 = _mm256_and_si256(avx2, mask);
avx3 = _mm256_and_si256(avx3, mask);
avx4 = _mm256_and_si256(avx4, mask);
avx5 = _mm256_and_si256(avx5, mask);
avx6 = _mm256_and_si256(avx6, mask);
avx7 = _mm256_and_si256(avx7, mask);
sse0 = _mm_packus_epi32(_mm256_castsi256_si128(avx0), _mm256_extracti128_si256(avx0, 1));
sse1 = _mm_packus_epi32(_mm256_castsi256_si128(avx1), _mm256_extracti128_si256(avx1, 1));
sse2 = _mm_packus_epi32(_mm256_castsi256_si128(avx2), _mm256_extracti128_si256(avx2, 1));
sse3 = _mm_packus_epi32(_mm256_castsi256_si128(avx3), _mm256_extracti128_si256(avx3, 1));
sse4 = _mm_packus_epi32(_mm256_castsi256_si128(avx4), _mm256_extracti128_si256(avx4, 1));
sse5 = _mm_packus_epi32(_mm256_castsi256_si128(avx5), _mm256_extracti128_si256(avx5, 1));
sse6 = _mm_packus_epi32(_mm256_castsi256_si128(avx6), _mm256_extracti128_si256(avx6, 1));
sse7 = _mm_packus_epi32(_mm256_castsi256_si128(avx7), _mm256_extracti128_si256(avx7, 1));
_mm_storeu_si128((__m128i *)(dst), sse0);
_mm_storeu_si128((__m128i *)(dst + 8), sse1);
_mm_storeu_si128((__m128i *)(dst + 16), sse2);
_mm_storeu_si128((__m128i *)(dst + 24), sse3);
_mm_storeu_si128((__m128i *)(dst + 32), sse4);
_mm_storeu_si128((__m128i *)(dst + 40), sse5);
_mm_storeu_si128((__m128i *)(dst + 48), sse6);
_mm_storeu_si128((__m128i *)(dst + 56), sse7);
}
else
#endif
{
for (int32_t i = 0; i < 64; i += 4)
{
*dst++ = src[*lut++];
*dst++ = src[*lut++];
*dst++ = src[*lut++];
*dst++ = src[*lut++];
}
}
}
You're right that gather is slower than a PINSRD loop on Haswell. It's probably nearly break-even on Broadwell. (See also the x86 tag wiki for perf links, especially Agner Fog's insn tables, microarch pdf, and optimization guide)
If your indices are small, or you can slice them up, pshufb can be used as parallel LUT with 4bit indices. It gives you sixteen 8bit table entries, but you can use stuff like punpcklbw to combine two vectors of byte results into one vector of 16bit results. (Separate tables for high and low halves of the LUT entries, with the same 4bit indices).
This kind of technique gets used for Galois Field multiplies, when you want to multiply every element of a big buffer of GF16 values by the same value. (e.g. for Reed-Solomon error correction codes.) Like I said, taking advantage of this requires taking advantage of special properties of your use-case.
AVX2 can do two 128b pshufbs in parallel, in each lane of a 256b vector. There is nothing better until AVX512F: __m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b). There are byte (vpermi2b in AVX512VBMI), word (vpermi2w in AVX512BW), dword (this one, vpermi2d in AVX512F), and qword (vpermi2q in AVX512F) element size versions. This is a full cross-lane shuffle, indexing into two concatenated source registers. (Like AMD XOP's vpperm).
The two different instructions behind the one intrinsic (vpermt2d / vpermi2d) give you a choice of overwriting the table with the result, or overwriting the index vector. The compiler will pick based on which inputs are reused.
Your specific case:
*dst++ = src[*lut++];
The lookup-table is actually src, not the variable you've called lut. lut is actually walking through an array which is used as a shuffle-control mask for src.
You should make g_tables an array of uint8_t for best performance. The entries are only 0..63, so they fit. Zero-extending loads into full registers are as cheap as normal loads, so it just reduces the cache footprint. To use it with AVX2 gathers, use vpmovzxbd. The intrinsic is frustratingly difficult to use as a load, because there's no form that takes an int64_t *, only __m256i _mm256_cvtepu8_epi32 (__m128i a) which takes a __m128i. This is one of the major design flaws with intrinsics, IMO.
I don't have any great ideas for speeding up your loop. Scalar code is probably the way to go here. The SIMD code shuffles 64 int16_t values into a new destination, I guess. It took me a while to figure that out, because I didn't find the if (sizeof...) line right away, and there are no comments. :( It would be easier to read if you used sane variable names, not avx0... Using x86 gather instructions for elements smaller than 4B certainly requires annoying masking. However, instead of pack, you could use a shift and OR.
You could make an AVX512 version for sizeof(T) == sizeof(int8_t) or sizeof(T) == sizeof(int16_t), because all of src will fit into one or two zmm registers.
If g_tables was being used as a LUT, AVX512 could do it easily, with vpermi2b. You'd have a hard time with out AVX512, though, because a 64 byte table is too big for pshufb. Using four lanes (16B) of pshufb for each input lane could work: Mask off indices outside 0..15, then indices outside 16..31, etc, with pcmpgtb or something. Then you have to OR all four lanes together. So this sucks a lot.
possible speedups: design the shuffle by hand
If you're willing to design a shuffle by hand for a specific value of g_tables, there are potential speedups that way. Load a vector from src, shuffle it with a compile-time constant pshufb or pshufd, then store any contiguous blocks in one go. (Maybe with pextrd or pextrq, or even better movq from the bottom of the vector. Or even a full-vector movdqu).
Actually, loading multiple src vectors and shuffling between them is possible with shufps. It works fine on integer data, with no slowdowns except on Nehalem (and maybe also on Core2). punpcklwd / dq / qdq (and the corresponding punpckhwd etc) can interleave elements of vectors, and give different choices for data movement than shufps.
If it doesn't take too many instructions to construct a few full 16B vectors, you're in good shape.
If g_tables can take on too many possible values, it might be possible to JIT-compile a custom shuffle function. This is probably really hard to do well, though.

DirectX 11 Compute Shader device synchronization?

Background: perform benchmarking/comparisson over GPGPU platforms.
Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.
Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.
CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.
Code DX_1:
// Setup
...
for (...) {
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
}
// Release
...
Code DX_2:
for (...) {
// Setup
...
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
...
}
Results (average times of 2^2 to 2^11 elements):
DX_1 DX_2 CUDA
1.6 205.5 24.8
1.8 133.4 24.8
29.1 186.5 25.6
18.6 175.0 25.6
11.4 187.5 26.6
85.2 127.7 26.3
166.4 151.1 28.1
98.2 149.5 35.2
26.8 203.5 31.6
Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.
Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.
My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.
CUDA Event Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0
In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.
Code Examples, CUDA
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
...
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
OpenCL
cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
...
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);
DirectCompute
Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:
ID3D11Device* device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
D3D11_QUERY_DESC desc;
desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
...
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;
C/C++/OpenMP
static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;
static void __inline startTimer()
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
}
static double __inline stopTimer()
{
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
}
My code examples are taken out of context and I tried to do some clean-up but errors might be present.
If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.
This is probably a good primer / example on how to do it:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

How to divide a 300*200 Lattice Boltzman cube across 16 cores - MPI

I am new to parallel programming using MPI... I need to parallelize a 300x200 Lattice Boltzman cube.. I managed chunking row wise by dividing 200 into chunks depending on size... However my code works only when there are 4 and 8 cores... I need to run the program on 16 cores..Can anyone please direct me as to how to divide 200 for 16 cores..
I am splitting currently in the following manner:
.
.
.
MPI_Init( &argc, &argv );
/* size and rank will become ubiquitous */
/* get no of process (size) & rank of each proces*/
MPI_Comm_size( MPI_COMM_WORLD, &size );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
start= (200 / size) * rank;
end = start + (200 / size);
.
.
.
for(ii=start;ii<end;ii++) {
for(jj=0;jj<300;jj++)
.
.
.
}
.
.
CLearly the above technique would work only if 200%size = 0, for 16 cores size=16, and hence the approach would fail.. Could anyone please suggest a more generalized chunking approach approach..which would make the program independent (if possible) of number of cores I would be running it on..
The easiest way to fix that would be to calculate 'start' and 'end' as
slice_size = (200 + size - 1)/size; // max. amount of rows per core
start = rank * slice_size;
end = min(start + slice_size, 200);
In this case, some cores can be underloaded.
A better scalability could be achieved if the lattice is divided not by rows only, but both by rows and columns, or even in non-rectangular areas, for example, using a linear representation of the lattice, like this:
total_cells = rows * columns;
common_piece_size = (total_cells + size - 1) / size;
start = rank * common_piece_size;
end = min(start + common_piece_size, total_size);
for (i = start; i < end; i++) {
row = i / columns;
col = i % columns;
// process cell [col, row]
}
. This would require more complex inter-process communication though.

Resources