Almost lockless producer consumer

Almost lockless producer consumer - thread-safety

I have producer consumer problem to solve with slight modification - there are many parallel producers but only one consumer in one parallel thread. When producer has no place in buffer then it simply ignore element (without waiting on consumer). I have writtent some C pseudocode:
struct Element
{
ULONG content;
volatile LONG bNew;
}
ULONG max_count = 10;
Element buffer* = calloc(max_count, sizeof(Element));
volatile LONG producer_idx = 0;
LONG consumer_idx = 0;
EVENT NotEmpty;
BOOLEAN produce(ULONG content)
{
LONG idx = InterlockedIncrement(&consumer_idx) % max_count;
if(buffer[idx].bNew)
return FALSE;
buffer[idx].content = content;
buffer[idx].bNew = TRUE;
SetEvent(NotEmpty);
return TRUE;
}
void consume_thread()
{
while(TRUE)
{
Wait(NotEmpty);
while(buffer[consumer_idx].bNew)
{
ULONG content = buffer[consumer_idx].content;
InterlockedExchange(&buffer[consumer_idx].bNew, FALSE);
//Simple mechanism for preventing producer_idx overflow
LONG tmp = producer_idx;
InterlockedCompareExchange(&producer_idx, tmp%maxcount, tmp);
consumer_idx = (consumer_idx+1)%max_count;
doSth(content);
}
}
}
I am no 100% sure that this code is correct. Can you see any problems that could occur? Or maybe this code could be written in better way?

Don't use global variable to accomplish your goal, especially in multithreaded application!!! Use Semaphore instead and don't do Lock but TryLock instead. If TryLock fails it means there's no room for another element, so you can skip it.
Here you can find something to read about semaphores in WinAPI because you would probably use it:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686946(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/ms687032(v=vs.85).aspx
You can achieve TryLock functionality by passing 0 as timeout to WaitForSingleObject function.

Please, read this: http://en.wikipedia.org/wiki/Memory_barrier
The C and C++ standards do not address multiple threads (or multiple
processors), and as such, the usefulness of volatile depends on the
compiler and hardware. Although volatile guarantees that the volatile
reads and volatile writes will happen in the exact order specified in
the source code, the compiler may generate code (or the CPU may
re-order execution) such that a volatile read or write is reordered
with regard to non-volatile reads or writes, thus limiting its
usefulness as an inter-thread flag or mutex. Moreover, it is not
guaranteed that volatile reads and writes will be seen in the same
order by other processors due to caching, cache coherence protocol and
relaxed memory ordering, meaning volatile variables alone may not even
work as inter-thread flags or mutexes.
So in common case just volatile won't work for C.
But it could work for some specific compilers/hardware and other languages (Java 5, for example).
See also Is function call a memory barrier?

Related

Trap memory accesses inside a standard executable built with MinGW

So my problem sounds like this.
I have some platform dependent code (embedded system) which writes to some MMIO locations that are hardcoded at specific addresses.
I compile this code with some management code inside a standard executable (mainly for testing) but also for simulation (because it takes longer to find basic bugs inside the actual HW platform).
To alleviate the hardcoded pointers, i just redefine them to some variables inside the memory pool. And this works really well.
The problem is that there is specific hardware behavior on some of the MMIO locations (w1c for example) which makes "correct" testing hard to impossible.
These are the solutions i thought of:
1 - Somehow redefine the accesses to those registers and try to insert some immediate function to simulate the dynamic behavior. This is not really usable since there are various ways to write to the MMIO locations (pointers and stuff).
2 - Somehow leave the addresses hardcoded and trap the illegal access through a seg fault, find the location that triggered, extract exactly where the access was made, handle and return. I am not really sure how this would work (and even if it's possible).
3 - Use some sort of emulation. This will surely work, but it will void the whole purpose of running fast and native on a standard computer.
4 - Virtualization ?? Probably will take a lot of time to implement. Not really sure if the gain is justifiable.
Does anyone have any idea if this can be accomplished without going too deep? Maybe is there a way to manipulate the compiler in some way to define a memory area for which every access will generate a callback. Not really an expert in x86/gcc stuff.
Edit: It seems that it's not really possible to do this in a platform independent way, and since it will be only windows, i will use the available API (which seems to work as expected). Found this Q here:
Is set single step trap available on win 7?
I will put the whole "simulated" register file inside a number of pages, guard them, and trigger a callback from which i will extract all the necessary info, do my stuff then continue execution.
Thanks all for responding.

I think #2 is the best approach. I routinely use approach #4, but I use it to test code that is running in the kernel, so I need a layer below the kernel to trap and emulate the accesses. Since you have already put your code into a user-mode application, #2 should be simpler.
The answers to this question may provide help in implementing #2. How to write a signal handler to catch SIGSEGV?
What you really want to do, though, is to emulate the memory access and then have the segv handler return to the instruction after the access. This sample code works on Linux. I'm not sure if the behavior it is taking advantage of is undefined, though.
#include <stdint.h>
#include <stdio.h>
#include <signal.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static void segv_handler(int, siginfo_t *, void *);
int main()
{
struct sigaction action = { 0, };
action.sa_sigaction = segv_handler;
action.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &action, NULL);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static void segv_handler(int, siginfo_t *info, void *ucontext_arg)
{
ucontext_t *ucontext = static_cast<ucontext_t *>(ucontext_arg);
ucontext->uc_mcontext.gregs[REG_RAX] = 1234;
ucontext->uc_mcontext.gregs[REG_RIP] += 2;
}
The code to read the register is written in assembly to ensure that both the destination register and the length of the instruction are known.

This is how the Windows version of prl's answer could look like:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *);
int main()
{
SetUnhandledExceptionFilter(segv_handler);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *ep)
{
// only handle read access violation of REG_ADDR
if (ep->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION ||
ep->ExceptionRecord->ExceptionInformation[0] != 0 ||
ep->ExceptionRecord->ExceptionInformation[1] != (ULONG_PTR)REG_ADDR)
return EXCEPTION_CONTINUE_SEARCH;
ep->ContextRecord->Rax = 1234;
ep->ContextRecord->Rip += 2;
return EXCEPTION_CONTINUE_EXECUTION;
}

So, the solution (code snippet) is as follows:
First of all, i have a variable:
__attribute__ ((aligned (4096))) int g_test;
Second, inside my main function, i do the following:
AddVectoredExceptionHandler(1, VectoredHandler);
DWORD old;
VirtualProtect(&g_test, 4096, PAGE_READWRITE | PAGE_GUARD, &old);
The handler looks like this:
LONG WINAPI VectoredHandler(struct _EXCEPTION_POINTERS *ExceptionInfo)
{
static DWORD last_addr;
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) {
last_addr = ExceptionInfo->ExceptionRecord->ExceptionInformation[1];
ExceptionInfo->ContextRecord->EFlags |= 0x100; /* Single step to trigger the next one */
return EXCEPTION_CONTINUE_EXECUTION;
}
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) {
DWORD old;
VirtualProtect((PVOID)(last_addr & ~PAGE_MASK), 4096, PAGE_READWRITE | PAGE_GUARD, &old);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
This is only a basic skeleton for the functionality. Basically I guard the page on which the variable resides, i have some linked lists in which i hold pointers to the function and values for the address in question. I check that the fault generating address is inside my list then i trigger the callback.
On first guard hit, the page protection will be disabled by the system, but i can call my PRE_WRITE callback where i can save the variable state. Because a single step is issued through the EFlags, it will be followed immediately by a single step exception (which means that the variable was written), and i can trigger a WRITE callback. All the data required for the operation is contained inside the ExceptionInformation array.
When someone tries to write to that variable:
*(int *)&g_test = 1;
A PRE_WRITE followed by a WRITE will be triggered,
When i do:
int x = *(int *)&g_test;
A READ will be issued.
In this way i can manipulate the data flow in a way that does not require modifications of the original source code.
Note: This is intended to be used as part of a test framework and any penalty hit is deemed acceptable.
For example, W1C (Write 1 to clear) operation can be accomplished:
void MYREG_hook(reg_cbk_t type)
{
/** We need to save the pre-write state
* This is safe since we are assured to be called with
* both PRE_WRITE and WRITE in the correct order
*/
static int pre;
switch (type) {
case REG_READ: /* Called pre-read */
break;
case REG_PRE_WRITE: /* Called pre-write */
pre = g_test;
break;
case REG_WRITE: /* Called after write */
g_test = pre & ~g_test; /* W1C */
break;
default:
break;
}
}
This was possible also with seg-faults on illegal addresses, but i had to issue one for each R/W, and keep track of a "virtual register file" so a bigger penalty hit. In this way i can only guard specific areas of memory or none, depending on the registered monitors.

update integer array elements atomically C++

Given a shared array of integer counters, I am interested to know if a thread can atomically fetch and add an array element without locking the entire array?
Here's an illustration of working model that uses mutex to lock access to the entire array.
// thread-shared class members
std::mutex count_array_mutex_;
std::vector<int> counter_array_( 100ish );
// Thread critical section
int counter_index = ... // unpredictable index
int current_count;
{
std::lock_guard<std::mutex> lock(count_array_mutex_);
current_count = counter_array_[counter_index] ++;
}
// ... do stuff using current_count.
I'd like multiple threads to be able to fetch-add separate array elements simultaneously.
So far, in my research of std::atomic<int> I'm thrown off that constructing the atomic object also constructs the protected member. (And plenty of answers explaining why you can't make a std::vector<std::atomic<int> > )

C++20 / C++2a (or whatever you want to call it) will add std::atomic_ref<T> which lets you do atomic operations on an object that wasn't atomic<T> to start with.
It's not available yet as part of the standard library for most compilers, but there is a working implementation for gcc/clang/ICC / other compilers with GNU extensions.
Previously, atomic access to "plain" data was only available with some platform-specific functions like Microsoft's LONG InterlockedExchange(LONG volatile *Target, LONG Value); or GNU C / C++
type __atomic_add_fetch (type *ptr, type val, int memorder) (the same builtins that C++ libraries for GNU compilers use to implement std::atomic<T>.)
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0019r8.html includes some intro stuff about the motivation. CPUs can easily do this, compilers can already do this, and it's been annoying that C++ didn't expose this capability portably.
So instead of having to wrestle with C++ to get all the non-atomic allocation and init done in a constructor, you can just have every access create an atomic_ref to the element you want to access. (It's free to instantiate as a local, at least when it's lock-free, on any "normal" C++ implementations).
This will even let you do things like resize the std::vector<int> after you've ensured no other threads are accessing the vector elements or the vector control block itself. And then you can signal the other threads to resume.
It's not yet implemented in libstdc++ or libc++ for gcc/clang.
#include <vector>
#include <atomic>
#define Foo std // this atomic_ref.hpp puts it in namespace Foo, not std.
// current raw url for https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
#include "https://raw.githubusercontent.com/ORNL/cpp-proposals-pub/580934e3b8cf886e09accedbb25e8be2d83304ae/P0019/atomic_ref.hpp"
void inc_element(std::vector<int> &v, size_t idx)
{
v[idx]++;
}
void atomic_inc_element(std::vector<int> &v, size_t idx)
{
std::atomic_ref<int> elem(v[idx]);
static_assert(decltype(elem)::is_always_lock_free,
"performance is going to suck without lock-free atomic_ref<T>");
elem.fetch_add(1, std::memory_order_relaxed); // take your pick of memory order here
}
For x86-64, these compile exactly the way we'd hope with GCC,
using the sample implementation (for compilers implementing GNU extensions) linked in the C++ working-group proposal. https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
From the Godbolt compiler explorer with g++8.2 -Wall -O3 -std=gnu++2a:
inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
mov rax, QWORD PTR [rdi] # load the pointer member of std::vector
add DWORD PTR [rax+rsi*4], 1 # and index it as a memory destination
ret
atomic_inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
mov rax, QWORD PTR [rdi]
lock add DWORD PTR [rax+rsi*4], 1 # same but atomic RMW
ret
The atomic version is identical except it uses a lock prefix to make the read-modify-write atomic, by making sure no other core can read or write the cache line while this core is in the middle of atomically modifying it. Just in case you were curious how atomics work in asm.
Most non-x86 ISAs like AArch64 of course require a LL/SC retry loop to implement an atomic RMW, even with relaxed memory order.
The point here is that constructing / destructing the atomic_ref doesn't cost anything. Its member pointer fully optimizes away. So this is exactly as cheap as a vector<atomic<int>>, but without the headache.
As long as you're careful not to create data-race UB by resizing the vector, or accessing an element without going through atomic_ref. (It would potentially manifest as a use-after-free on many real implementations if std::vector reallocated the memory in parallel with another thread indexing into it, and of course you'd be atomically modifying a stale copy.)
This definitely gives you rope to hang yourself if you don't carefully respect the fact that the std::vector object itself is not atomic, and also that the compiler won't stop you from doing non-atomic access to the underlying v[idx] after other threads have started using it.

One way:
// Create.
std::vector<std::atomic<int>> v(100);
// Initialize.
for(auto& e : v)
e.store(0, std::memory_order_relaxed);
// Atomically increment.
auto unpredictable_index = std::rand() % v.size();
int old = v[unpredictable_index].fetch_add(1, std::memory_order_relaxed);
Note that std::atomic<> copy-constructor is deleted, so that the vector cannot be resized and needs to be initialized with the final count of elements.
Since resize functionality of std::vector is lost, instead of std::vector you may as well use std::unique_ptr<std::atomic<int>[]>, e.g.:
// Create.
unsigned const N = 100;
std::unique_ptr<std::atomic<int>[]> p(new std::atomic<int>[N]);
// Initialize.
for(unsigned i = 0; i < N; ++i)
p[i].store(0, std::memory_order_relaxed);
// Atomically increment.
auto unpredictable_index = std::rand() % N;
int old = p[unpredictable_index].fetch_add(1, std::memory_order_relaxed);

Why doesn't boost::lockfree::spsc_queue have emplace?

The regular std::vector has emplace_back which avoid an unnecessary copy. Is there a reason spsc_queue doesn't support this? Is it impossible to do emplace with lock-free queues for some reason?

I'm not a boost library implementer nor maintainer, so the rationale behind why not to include an emplace member function is beyond my knowledge, but it isn't too difficult to implement it yourself if you really need it.
The spsc_queue has a base class of either compile_time_sized_ringbuffer or runtime_sized_ringbuffer depending on if the size of the queue is known at compilation or not. These two classes maintain the actual buffer used with the obvious differences between a dynamic buffer and compile-time buffer, but delegate, in this case, their push member functions to a common base class - ringbuffer_base.
The ringbuffer_base::push function is relatively easy to grok:
bool push(T const & t, T * buffer, size_t max_size)
{
const size_t write_index = write_index_.load(memory_order_relaxed); // only written from push thread
const size_t next = next_index(write_index, max_size);
if (next == read_index_.load(memory_order_acquire))
return false; /* ringbuffer is full */
new (buffer + write_index) T(t); // copy-construct
write_index_.store(next, memory_order_release);
return true;
}
An index into the location where the next item should be stored is done with a relaxed load (which is safe since the intended use of this class is single producer for the push calls) and gets the appropriate next index, checks to make sure everything is in-bounds (with a load-acquire for appropriate synchronization with the thread that calls pop) , but the main statement we're interested in is:
new (buffer + write_index) T(t); // copy-construct
Which performs a placement new copy construction into the buffer. There's nothing inherently thread-unsafe about passing around some parameters to use to construct a T directly from viable constructor arguments. I wrote the following snippet and made the necessary changes throughout the derived classes to appropriately delegate the work up to the base class:
template<typename ... Args>
std::enable_if_t<std::is_constructible<T,Args...>::value,bool>
emplace( T * buffer, size_t max_size,Args&&... args)
{
const size_t write_index = write_index_.load(memory_order_relaxed); // only written from push thread
const size_t next = next_index(write_index, max_size);
if (next == read_index_.load(memory_order_acquire))
return false; /* ringbuffer is full */
new (buffer + write_index) T(std::forward<Args>(args)...); // emplace
write_index_.store(next, memory_order_release);
return true;
}
Perhaps the only difference is making sure that the arguments passed in Args... can actually be used to construct a T, and of course doing the emplacement via std::forward instead of a copy construction.

asio implicit strand and data synchronization

When I read asio source code, I am curious about how asio making data synchronized between threads even a implicit strand was made. These are code in asio:
io_service::run
mutex::scoped_lock lock(mutex_);
std::size_t n = 0;
for (; do_run_one(lock, this_thread, ec); lock.lock())
if (n != (std::numeric_limits<std::size_t>::max)())
++n;
return n;
io_service::do_run_one
while (!stopped_)
{
if (!op_queue_.empty())
{
// Prepare to execute first handler from queue.
operation* o = op_queue_.front();
op_queue_.pop();
bool more_handlers = (!op_queue_.empty());
if (o == &task_operation_)
{
task_interrupted_ = more_handlers;
if (more_handlers && !one_thread_)
{
if (!wake_one_idle_thread_and_unlock(lock))
lock.unlock();
}
else
lock.unlock();
task_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
// Run the task. May throw an exception. Only block if the operation
// queue is empty and we're not polling, otherwise we want to return
// as soon as possible.
task_->run(!more_handlers, this_thread.private_op_queue);
}
else
{
std::size_t task_result = o->task_result_;
if (more_handlers && !one_thread_)
wake_one_thread_and_unlock(lock);
else
lock.unlock();
// Ensure the count of outstanding work is decremented on block exit.
work_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
// Complete the operation. May throw an exception. Deletes the object.
o->complete(*this, ec, task_result);
return 1;
}
}
in its do_run_one, the unlock of mutex are all before execute handler. If there is a implicit strand, handler will not executed concurrent, but the problem is: thread A run a handler which modify data, and thread B run next handler which read the data which had been modified by thread A. Without the protect of mutex, how thread B seen the changes of data made by thread A? The mutex unlocking ahead of handler execution doesn't make a happen-before relationship between threads access the data which handler accessed.
When I go further, the handler execution use a thing called fenced_block:
completion_handler* h(static_cast<completion_handler*>(base));
ptr p = { boost::addressof(h->handler_), h, h };
BOOST_ASIO_HANDLER_COMPLETION((h));
// Make a copy of the handler so that the memory can be deallocated before
// the upcall is made. Even if we're not about to make an upcall, a
// sub-object of the handler may be the true owner of the memory associated
// with the handler. Consequently, a local copy of the handler is required
// to ensure that any owning sub-object remains valid until after we have
// deallocated the memory here.
Handler handler(BOOST_ASIO_MOVE_CAST(Handler)(h->handler_));
p.h = boost::addressof(handler);
p.reset();
// Make the upcall if required.
if (owner)
{
fenced_block b(fenced_block::half);
BOOST_ASIO_HANDLER_INVOCATION_BEGIN(());
boost_asio_handler_invoke_helpers::invoke(handler, handler);
BOOST_ASIO_HANDLER_INVOCATION_END;
}
what is this? I know fence seems a sync primitive which supported by C++11 but this fence is totally writen by asio itself. Does this fenced_block help to do the job of data synchronization?
UPDATED
After I google and read this and this, asio indeed use memory fence primitive to synchronize data in threads, that is more faster than unlock till the handler execute complete(speed difference on x86). In fact Java volatile keyword is implemented by insert memory barrier after write & before read this variable to make happen-before relationship.
If someone could simply describe asio memory fence implemenation or add something I missed or misunderstand, I will accept it.

Before the operation invokes the user handler, Boost.Asio uses a memory fence to provide the appropriate memory reordering without forcing mutual execution of handler execution. Thus, thread B would observe changes to memory that occurred within the context of thread A.
C++03 did not specify requirements for memory visibility with regards to multi-threaded execution. However, C++11 defines these requirements in § 1.10 Multi-threaded executions and data races, as well as the Atomic operations and Thread support library sections. Boost and C++11 mutexes do perform the appropriate memory reordering. For other implementations, it is worth checking the mutex library's documentation to verify memory reordering occurs.
Boost.Asio memory fences are an implementation detail, and thus always subject to change. Boost.Asio abstracts itself from the architecture/compiler specific implementations through a series of conditional defines within asio/detail/fenced_block.hpp where only a single memory barrier implementation is included. The underlying implementation is contained within a class, for which a fenced_block alias is created via a typedef.
Here is a relevant excerpt:
#elif defined(__GNUC__) && (defined(__hppa) || defined(__hppa__))
# include "asio/detail/gcc_hppa_fenced_block.hpp"
#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
# include "asio/detail/gcc_x86_fenced_block.hpp"
#elif ...
...
namespace asio {
namespace detail {
...
#elif defined(__GNUC__) && (defined(__hppa) || defined(__hppa__))
typedef gcc_hppa_fenced_block fenced_block;
#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
typedef gcc_x86_fenced_block fenced_block;
#elif ...
...
} // namespace detail
} // namespace asio
The implementations of the the memory barriers are specific to the architecture and compilers. Boost.Asio has a family of asio/detail/*_fenced_blocked.hpp header files. For example, the win_fenced_block uses InterlockedExchange for Borland; otherwise it uses the xchg assembly instruction, which has an implicit lock prefix when used with a memory address. For gcc_x86_fenced_block, Boost.Asio uses the memory assembly instruction.
If you find yourself needing to use a fence, then consider the Boost.Atomic library. Introduced in Boost 1.53, Boost.Atomic provides an implementation of thread and signal fences based the C++11 standard. Boost.Asio has been using its own implementation of memory fences prior to the Boost.Atomic being added to Boost. Also, the Boost.Asio fences are scoped based. fenced_block will perform an acquire in its constructor, and a release in its destructor.

What useful things can I do with Visual C++ Debug CRT allocation hooks except finding reproduceable memory leaks?

Visual C++ debug runtime library features so-called allocation hooks. Works this way: you define a callback and call _CrtSetAllocHook() to set that callback. Now every time a memory allocation/deallocation/reallocation is done CRT calls that callback and passes a handful of parameters.
I successfully used an allocation hook to find a reproduceable memory leak - basically CRT reported that there was an unfreed block with allocation number N (N was the same on every program run) at program termination and so I wrote the following in my hook:
int MyAllocHook( int allocType, void* userData, size_t size, int blockType,
long requestNumber, const unsigned char* filename, int lineNumber)
{
if( requestNumber == TheNumberReported ) {
Sleep( 0 );// a line to put breakpoint on
}
return TRUE;
}
since the leak was reported with the very same allocation number every time I could just put a breakpoint inside the if-statement and wait until it was hit and then inspect the call stack.
What other useful things can I do using allocation hooks?

You could also use it to find unreproducible memory leaks:
Make a data structure where you map the allocated pointer to additional information
In the allocation hook you could query the current call stack (StackWalk function) and store the call stack in the data structure
In the de-allocation hook, remove the call stack information for that allocation
At the end of your application, loop over the data structure and report all call stacks. These are the places where memory was allocated but not freed.

The value "requestNumber" is not passed on to the function when deallocating (MS VS 2008). Without this number you cannot keep track of your allocation. However, you can peek into the heap header and extract that value from there:
Note: This is compiler dependent and may change without notice/ warning by the compiler.
// This struct is a copy of the heap header used by MS VS 2008.
// This information is prepending each allocated memory object in debug mode.
struct MsVS_CrtMemBlockHeader {
MsVS_CrtMemBlockHeader * _next;
MsVS_CrtMemBlockHeader * _prev;
char * _szFilename;
int _nLine;
int _nDataSize;
int _nBlockUse;
long _lRequest;
char _gap[4];
};
int MyAllocHook(..) { // same as in question
if(nAllocType == _HOOK_FREE) {
// requestNumber isn't passed on to the Hook on free.
// However in the heap header this value is stored.
size_t headerSize = sizeof(MsVS_CrtMemBlockHeader);
MsVS_CrtMemBlockHeader* pHead;
size_t ptr = (size_t) pvData - headerSize;
pHead = (MsVS_CrtMemBlockHeader*) (ptr);
long requestNumber = pHead->_lRequest;
// Do what you like to keep track of this allocation.
}
}

You could keep record of every allocation request then remove it once the deallocation is invoked, for instance: This could help you tracking memory leak problems that are way much worse than this to track down.
Just the first idea that comes to my mind...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio