How to get The thread priority in Windows Kernel? - windows

I am writing a kernel mode driver and I would like to get the priority of a /user-mode/ thread (should be a number between 0-15).
I have the PETHREAD.

KeQueryPriorityThread returns the current priority of the thread. It does not correspond to the priority given to a thread via SetThreadPriority. AFAIK, it is the combined priority, i.e. process priority + thread priority + dynamic priorities.

Related

How can I get the maximum number of OpenMP threads that may be created during the whole execution of the program?

I want to create one global array of objects (One object per possible thread spawned by OpenMP) and reuse it throughout the program. Each thread will read its number using omp_get_thread_num and use it to index into the array.
How can I get the maximum number of OpenMP threads that may be created during the whole execution of the program?
The documentation of omp_get_max_threads says that this function is specified to return a value which is specific to the particular parallel region where it was invoked
omp_get_max_threads – Maximum number of threads of parallel region
Description: Return the maximum number of threads used for the current parallel region that does not use the clause num_threads.
Whereas the wording of MSDN documentation implies that the value returned by omp_get_max_threads outside a parallel region is the same value that would be returned in any other point.
omp_get_max_threads
Returns an integer that is equal to or greater than the number of threads that would be available if a parallel region without num_threads were defined at that point in the code.
Which one of them is correct?
There is no maximum number.
Technically OpenMP defines an internal control variable called nthreads-var (see OpenMP 4.5 2.3.3) that is sort of the default number of threads. You read it with omp_get_max_threads, you set it with omp_set_num_threads (an unfortunate naming glitch), and you override it with an explicit num_threads clause.
So you will have to write your code such that it can cope with an unexpectedly number of threads, e.g. by predefining the array up to omp_get_num_threads() and lazily resizing it if more threads arrive. Or take a reasonable guess and check the index bounds on each access.

Why Garbage Collect in web apps?

Consider building a web app on a platform where every request is handled by a User Level Thread(ULT) (green thread/erlang process/goroutine/... any light weight thread). Assuming every request is stateless and resources like DB connection are obtained at startup of the app and shared between these threads. What is the need for garbage collection in these threads?
Generally such a thread is short running(a few milliseconds) and if well designed doesn't use more than a few (KB or MB) of memory. If garbage collection of the resources allocated in the thread is done at the exit of the thread and independent of the other threads, then there would be no GC pauses for even the 98th or 99th percentile of requests. All requests would be answered in predictable time.
What is the problem with such a model and why is it not being widely used?
You assumption might not be true.
if well designed doesn't use more than a few (KB or MB) of memory
Imagine a function for counting words in a text file which is used in a web app. Some naive implementation could be,
def count_words(text):
words = text.split()
count = {}
for w in words:
if w in count:
count[w] += 1
else:
count[w] = 1
return count
It allocates larger memory than text.

How do cuda threads are executed inside a single block?

I have several question regarding cuda. Following is a figure taken from a book on parallel programming. It shows how threads are allocated in the device for a multiplication of two vectors each of length 8192.
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
1)
In this particular example, each thread seems to be assigned to 32 elements in the vector. Code that is executed by a single thread is executed sequentially.
2)
The size of the thread blocks is up to the programmer. However, there are restrictions on the number and size of the thread blocks given the hardware the code is executed on. For more information on this, see this elaborate answer:
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)
From your illustration, it seems that:
The grid is composed of 16 thread blocks, numbered from 0 to 15.
Each block is composed of 16 "SIMD threads", numbered from 0 to 15
Each "SIMD thread" computes the product of 32 vector elements.
It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance:
A warp (wavefront) of 32 threads (work-items)
or:
A thread (work-item) working on 32 elements
I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least).
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads. These threads execute in lockstep, simultaneously, in parallel. The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. For example, the hardware may be capable of scheduling a number of warps for simultaneous execution.
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle.

Setting IO Priority of low instead of Very Low

I am trying to set the IO Priority of a process to LOW. Currently I am calling the SetProiorityClass Function with PROCESS_MODE_BACKGROUND_BEGIN. This sets the IO Priority to "VeryLow". How do I got about setting it one level higher to "LOW"
http://msdn.microsoft.com/en-us/library/ms686219(VS.85).aspx

fast thread ordering algorithm without atomic CAS

I am looking for an approach that will let me assign ordinal numbers 0..(N-1) to N O/S threads, such that the threads are in numeric order. That is, the thread that gets will have a lower O/S thread ID than the thread with ordinal 1.
In order to carry this out, the threads communicate via a shared memory space.
The memory ordering model is such that writes will be atomic (if two concurrent threads write a memory location at the same time, the result will be one or the other). The platform will not support atomic compare-and-set operations.
I am looking for an algorithm that is efficient in the number of writes to shared memory, and will complete rapidly with up to tens of thousands of threads, and with no nasty worse-case thread arrival conditions.
The O/S will assign thread numbers in arbitrary order throughout a 32-bit space. There may be arbitrary thread creation delays - the algorithm can be considered complete when all N threads are present.
I am unable to use the obvious solution of collecting all the threads, and then having one thread sort them - without an atomic operation, I have no way of safely collecting all the individual threads (another thread could rewrite the slot).
With no claim to being optimal in any sense (there are clearly faster ways to do this with atomic compare-and-set operations, or as Martin indicated, atomic increment)...
Assuming N is known to all the threads, and each thread has a unique non-zero ID value, such as its stack address in 32-bit space...
Use an array of size N in shared space; ensure that this array is initialized to zero.
Each thread owns the first slot in the array that holds an ID lower than or equal to the thread's ID; the thread writes its ID there. This continues until the array is full of non-zero values, and all the values are in decreasing order.
At the completion of the algorithm, the index of the thread's slot in the array is its ordinal number.
If I have this right you want to map Integer->Integer where the input in an arbitrary 32 bit number, and the output is a number from 0-N where N is the number of threads?
In that case, every time you create a new thread call this method, the returned value is the ID:
integer nextId = 0;
integer GetInteger()
{
return AtomicIncrement(nextId);
}
This algorithm is obviously O(N)
Assuming several things:
threads never die
You have some kind of atomic increment which increments a number and returns the old value or the new value, the difference is between 0 based and 1 based IDs
Threads call a method asking what their ID is just once when they're created

Resources