Why is it a Bad Thing for one process to be able to read, or even write, to memory occupied by a different process? [duplicate] - memory-management

It's my understanding that if two threads are reading from the same piece of memory, and no thread is writing to that memory, then the operation is safe. However, I'm not sure what happens if one thread is reading and the other is writing. What would happen? Is the result undefined? Or would the read just be stale? If a stale read is not a concern is it ok to have unsynchronized read-write to a variable? Or is it possible the data would be corrupted, and neither the read nor the write would be correct and one should always synchronize in this case?
I want to say that I've learned it is the later case, that a race on memory access leaves the state undefined... but I don't remember where I may have learned that and I'm having a hard time finding the answer on google. My intuition is that a variable is operated on in registers, and that true (as in hardware) concurrency is impossible (or is it), so that the worst that could happen is stale data, i.e. the following:
WriteThread: copy value from memory to register
WriteThread: update value in register
ReadThread: copy value of memory to register
WriteThread: write new value to memory
At which point the read thread has stale data.

Usually memory is read or written in atomic units determined by the CPU architecture (32 bit and 64 bits item aligned on 32 bit and 64 bit boundaries is common these days).
In this case, what happens depends on the amount of data being written.
Let's consider the case of 32 bit atomic read/write cells.
If two threads write 32 bits into such an aligned cell, then it is absolutely well defined what happens: one of the two written values is retained. Unfortunately for you (well, the program), you don't know which value. By extremely clever programming, you can actually use this atomicity of reads and writes to build synchronization algorithms (e.g., Dekker's algorithm), but it is faster typically to use architecturally defined locks instead.
If two threads write more than an atomic unit (e.g., they both write a 128 bit value), then in fact the atomic unit sized pieces of the values written will be stored in a absolutely well defined way, but you won't know which pieces of which value get written in what order. So what may end up in storage is the value from the first thread, the second thread, or mixes of the bits in atomic unit sizes from both threads.
Similar ideas hold for one thread reading, and one thread writing in atomic units, and larger.
Basically, you don't want to do unsynchronized reads and writes to memory locations, because you won't know the outcome, even though it may be very well defined by the architecture.

The result is undefined. Corrupted data is entirely possible. For an obvious example, consider a 64-bit value being manipulated by a 32-bit processor. Let's assume the value is a simple counter, and we increment it when the lower 32-bits contain 0xffffffff. The increment produces 0x00000000. When we detect that, we increment the upper word. If, however, some other thread read the value between the time the lower word was incremented and the upper word was incremented, they get a value with an un-incremented upper word, but the lower word set to 0 -- a value completely different from what it would have been either before or after the increment is complete.

As I hinted in Ira Baxter's answer, CPU cache also plays a part on multicore systems. Consider the following test code:
DANGER WILL ROBISON!
The following code boosts priority to realtime to achieve somewhat more consistent results - while doing so requires admin privileges, be careful if running the code on dual- or single-core systems, since your machine will lock up for the duration of the test run.
#include <windows.h>
#include <stdio.h>
const int RUNFOR = 5000;
volatile bool terminating = false;
volatile int value;
static DWORD WINAPI CountErrors(LPVOID parm)
{
int errors = 0;
while(!terminating)
{
value = (int) parm;
if(value != (int) parm)
errors++;
}
printf("\tThread %08X: %d errors\n", parm, errors);
return 0;
}
static void RunTest(int affinity1, int affinity2)
{
terminating = false;
DWORD dummy;
HANDLE t1 = CreateThread(0, 0, CountErrors, (void*)0x1000, CREATE_SUSPENDED, &dummy);
HANDLE t2 = CreateThread(0, 0, CountErrors, (void*)0x2000, CREATE_SUSPENDED, &dummy);
SetThreadAffinityMask(t1, affinity1);
SetThreadAffinityMask(t2, affinity2);
ResumeThread(t1);
ResumeThread(t2);
printf("Running test for %d milliseconds with affinity %d and %d\n", RUNFOR, affinity1, affinity2);
Sleep(RUNFOR);
terminating = true;
Sleep(100); // let threads have a chance of picking up the "terminating" flag.
}
int main()
{
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
RunTest(1, 2); // core 1 & 2
RunTest(1, 4); // core 1 & 3
RunTest(4, 8); // core 3 & 4
RunTest(1, 8); // core 1 & 4
}
On my Quad-core intel Q6600 system (which iirc has two sets of cores where each set share L2 cache - would explain the results anyway ;)), I get the following results:
Running test for 5000 milliseconds with affinity 1 and 2
Thread 00002000: 351883 errors
Thread 00001000: 343523 errors
Running test for 5000 milliseconds with affinity 1 and 4
Thread 00001000: 48073 errors
Thread 00002000: 59813 errors
Running test for 5000 milliseconds with affinity 4 and 8
Thread 00002000: 337199 errors
Thread 00001000: 335467 errors
Running test for 5000 milliseconds with affinity 1 and 8
Thread 00001000: 55736 errors
Thread 00002000: 72441 errors

Related

Data races with MESI optimization

I dont really understand what exactly is causing the problem in this example:
Here is a snippet from my book:
Based on the discussion of the MESI protocol in the preceding section, it would
seem that the problem of data sharing between L1 caches in a multicore machine
has been solved in a watertight way. How, then, can the memory ordering
bugs we’ve hinted at actually happen?
There’s a one-word answer to that question: Optimization. On most hardware,
the MESI protocol is highly optimized to minimize latency. This means
that some operations aren’t actually performed immediately when messages
are received over the ICB. Instead, they are deferred to save time. As with
compiler optimizations and CPU out-of-order execution optimizations, MESI
optimizations are carefully crafted so as to be undetectable by a single thread.
But, as you might expect, concurrent programs once again get the raw end of
this deal.
For example, our producer (running on Core 1) writes 42 into g_data and
then immediately writes 1 into g_ready. Under certain circumstances, optimizations
in the MESI protocol can cause the new value of g_ready to become
visible to other cores within the cache coherency domain before the updated
value of g_data becomes visible. This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet. This means that the consumer (on Core 2) can potentially
see a value of 1 for g_ready before it sees a value of 42 in g_data, resulting in
a data race bug.
Here is the code:
int32_t g_data = 0;
int32_t g_ready = 0;
void ProducerThread() // running on Core 1
{
g_data = 42;
// assume no instruction reordering across this line
g_ready = 1;
}
void ConsumerThread() // running on Core 2
{
while (!g_ready)
PAUSE();
// assume no instruction reordering across this line
ASSERT(g_data == 42);
}
How can g_data be computed but not present in the cache?
This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet.
If g_data is not in cache, then why does the previous sentece end with a yet? Would the CPU load the cache line with g_data after it has been computed?
If we read this sentence:
This means that some operations aren’t actually performed immediately when messages are received over the ICB. Instead, they are deferred to save time.
Then what operation is deferred in our example with producer and consumer threads?
So basically I dont understand how under the MESI protocol, some operations are visible to other cores in the wrong order, despite being computed in the right order by a specific core.
PS:
This example is from a book called "Game Engine Architecture, Third Edition" by Jason Gregory, its on the page 309. Here is the book

OpenCL Pseudo random generator race condition

Using this question as basis I implemented a pseudo-random number generator with a global state:
__global uint global_random_state;
void set_random_seed(uint seed){
global_random_state = seed;
}
uint get_random_number(uint range){
uint seed = global_random_state + get_global_id(0);
uint t = seed ^ (seed << 11);
uint result = seed ^ (seed >> 19) ^ (t ^ (t >> 8));
global_random_state = result; /* race condition? */
return result % range;
}
Since these functions will be used from multiple threads, there will be a race condition present when writing to global_random_state.
This might actually help the system to be more unpredictable, so it seems like a good thing, but I'd like to know if there are any consequences to this that might not surface immediately. Are there any side-effects inside the GPU which might cause problems later on when the kernel is run?
In theory you want atom_cmpxchg for correctness here (or find the equivalent GPGPU). However, a grave note of warning, having the entire machine serializing through a single cacheline is going to strangle your performance fundamentally. Atomics on the same address must form a queue and wait. Atomics on different locations can parallelize (more details at the end).
Generally, algorithms that leverage random variables on GPGPU will keep their own copy of the random variable generators. This enables each work item to cache and potentially reuse their own random with out glutting the bus with memory traffic on every new random. Search for "OpenCL Monte Carlo" "Simulation" or "Example" for samples. CUDA has some nice examples too.
Another option is to use a random generator that allows one to skip ahead and have different work items move forward in the sequence different amounts. This can be more compute intensive though, but the tradeoff is that you don't strain the memory hierarchy as much.
More gory details on atomics: (1) GPU cache atomics are designed to expect contiguous arrays and atomic ALUs are per bank, (2) each dword in a cacheline will be processed by the same atomic ALU each time, and (3) neighboring cachelines will hash to different banks. So, if every clock you are doing atomics on contiguous cachelines of data then the work should be perfectly spread out (or statistically so). Conversely, if one makes every work item atomically modify the same 32b, then the cache system cannot apply all the same atomic ALU slot to 16/32/64 (whatever your system uses). It must break the operation up in 16/32/64 separate atomic operations apply it iteratively (by #2 above). In a system where you have 512 ALUs to process atomics you would be using 1 of those ALUs each clock (the same one). Spread the work out and you can use all 512/c.

Do all threads use the same number of registers in CUDA?

if (threadIdx.x < 128) {
float reg[32];
// do something with reg...
} else {
return;
}
let's say each block has 256 threads, but only half of the threads are using registers, and the other half is doing something else (in this case nothing). my question is, how many registers this thread block will use (only concidering reg)? 32 * 256 or 32 * 128 ?
All threads use the same number of registers.
It is a compile-time decision, the decision has nothing to do with runtime behavior, and the compiler determines register usage for all threads in the grid (i.e. kernel launch), it is not in any way decided per thread. At runtime, the necessary number of registers must be allocated for each thread, whether they "use" them or not. See here.
The answer to your question is that regardless of what the threads "do", the number of registers per block is equal to the number of registers per thread (determined at compile time) times the number of threads per block.
So in your example it would presumably be 32*256, not 32*128.

Faster memory allocation and freeing algorithm than multiple Free List method

We allocate and free many memory blocks. We use Memory Heap. However, heap access is costly.
For faster memory access allocation and freeing, we adopt a global Free List. As we make a multithreaded program, the Free List is protected by a Critical Section. However, Critical Section causes a bottleneck in parallelism.
For removing the Critical Section, we assign a Free List for each thread, i.e. Thread Local Storage. However, thread T1 always memory blocks and thread T2 always frees them, so Free List in thread T2 is always increasing, meanwhile there is no benefit of Free List.
Despite of the bottleneck of Critical Section, we adopt the Critical Section again, with some different method. We prepare several Free Lists as well as Critical Sections which is assigned to each Free List, thus 0~N-1 Free Lists and 0~N-1 Critical Sections. We prepare an atomic-operated integer value which mutates to 0, 1, 2, ... N-1 then 0, 1, 2, ... again. For each allocation and freeing, we get the integer value X, then mutate it, access X-th Critical Section, then access X-th Free List. However, this is quite slower than the previous method (using Thread Local Storage). Atomic operation is quite slow as there are more threads.
As mutating the integer value non-atomically cause no corruption, we did the mutation in non-atomic way. However, as the integer value is sometimes stale, there is many chance of accessing the same Critical Section and Free List by different threads. This causes the bottleneck again, though it is quite few than the previous method.
Instead of the integer value, we used thread ID with hashing to the range (0~N-1), then the performance got better.
I guess there must be much better way of doing this, but I cannot find an exact one. Are there any ideas for improving what we have made?
Dealing with heap memory is a task for the OS. Nothing guarantees you can do a better/faster job than the OS does.
But there are some conditions where you can get a bit of improvement, specially when you know something about your memory usage that is unknown to the OS.
I'm writting here my untested idea, hope you'll get some profit of it.
Let's say you have T threads, all of them reserving and freeing memory. The main goal is speed, so I'll try not to use TLS, nor critical blocking, not atomic ops.
If (repeat: if, if, if) the app can fit to several discrete sizes of memory blocks (not random sizes, so as to avoid fragmentation and unuseful holes) then start asking the OS for a number of these discrete blocks.
For example, you have an array of n1 blocks each of size size1, an array of n2 blocks each of size size2, an array of n3... and so on. Each array is bidimensional, the second field just stores a flag for used/free block. If your arrays are very large then it's better to use a dedicated array for the flags (due to contiguous memory usage is always faster).
Now, some one asks for a block of memory of size sB. A specialized function (or object or whatever) searches the array of blocks of size greater or equal to sB, and then selects a block by looking at the used/free flag. Just before ending this task the proper block-flag is set to "used".
When two or more threads ask for blocks of the same size there may be a corruption of the flag. Using TLS will solve this issue, and critical blocking too. I think you can set a bool flag at the beggining of the search into flags-array, that makes the other threads to wait until the flag changes, which only happens after the block-flag changes. With pseudo code:
MemoryGetter(sB)
{
//select which array depending of 'sB'
for (i=0, i < numOfarrays, i++)
if (sizeOfArr(i) >= sB)
arrMatch = i
break //exit for
//wait if other thread wants a block from the same arrMatch array
while ( searching(arrMatch) == true )
; //wait
//blocks other threads wanting a block from the same arrMatch array
searching(arrMatch) = true
//Get the first free block
for (i=0, i < numOfBlocks, i++)
if ( arrOfUsed(arrMatch, i) != true )
selectedBlock = addressOf(....)
//mark the block as used
arrOfUsed(arrMatch, i) = true
break; //exit for
//Allow other threads
searching(arrMatch) = false
return selectedBlock //NOTE: selectedBlock==NULL means no free block
}
Freeing a block is easier, just mark it as free, no thread concurrency issue.
Dealing with no free blocks is up to you (wait, use a bigger block, ask OS for more, etc).
Note that the whole memory is reserved from the OS at app start, which can be a problem.
If this idea makes your app faster, let me know. What I can say for sure is that memory used is greater than if you use normal OS request; but not much if you choose "good" sizes, those most used.
Some improvements can be done:
Cache the last freeded block (per size) so as to avoid the search.
Start with not that much blocks, and ask the OS for more memory only
when needed. Play with 'number of blocks' for each size depending on
your app. Find the optimal case.

What is the difference between mutex and critical section?

Please explain from Linux, Windows perspectives?
I am programming in C#, would these two terms make a difference. Please post as much as you can, with examples and such....
Thanks
For Windows, critical sections are lighter-weight than mutexes.
Mutexes can be shared between processes, but always result in a system call to the kernel which has some overhead.
Critical sections can only be used within one process, but have the advantage that they only switch to kernel mode in the case of contention - Uncontended acquires, which should be the common case, are incredibly fast. In the case of contention, they enter the kernel to wait on some synchronization primitive (like an event or semaphore).
I wrote a quick sample app that compares the time between the two of them. On my system for 1,000,000 uncontended acquires and releases, a mutex takes over one second. A critical section takes ~50 ms for 1,000,000 acquires.
Here's the test code, I ran this and got similar results if mutex is first or second, so we aren't seeing any other effects.
HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
CRITICAL_SECTION critSec;
InitializeCriticalSection(&critSec);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;
// Force code into memory, so we don't see any effects of paging.
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
}
QueryPerformanceCounter(&end);
int totalTimeCS = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
}
QueryPerformanceCounter(&end);
int totalTime = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
printf("Mutex: %d CritSec: %d\n", totalTime, totalTimeCS);
From a theoretical perspective, a critical section is a piece of code that must not be run by multiple threads at once because the code accesses shared resources.
A mutex is an algorithm (and sometimes the name of a data structure) that is used to protect critical sections.
Semaphores and Monitors are common implementations of a mutex.
In practice there are many mutex implementation availiable in windows. They mainly differ as consequence of their implementation by their level of locking, their scopes, their costs, and their performance under different levels of contention. See CLR Inside Out -
Using concurrency for scalability for an chart of the costs of different mutex implementations.
Availiable synchronization primitives.
Monitor
Mutex
Semaphore
ReaderWriterLock
ReaderWriterLockSlim
Interlocked
The lock(object) statement is implemented using a Monitor - see MSDN for reference.
In the last years much research is done on non-blocking synchronization. The goal is to implement algorithms in a lock-free or wait-free way. In such algorithms a process helps other processes to finish their work so that the process can finally finish its work. In consequence a process can finish its work even when other processes, that tried to perform some work, hang. Usinig locks, they would not release their locks and prevent other processes from continuing.
In addition to the other answers, the following details are specific to critical sections on windows:
in the absence of contention, acquiring a critical section is as simple as an InterlockedCompareExchange operation
the critical section structure holds room for a mutex. It is initially unallocated
if there is contention between threads for a critical section, the mutex will be allocated and used. The performance of the critical section will degrade to that of the mutex
if you anticipate high contention, you can allocate the critical section specifying a spin count.
if there is contention on a critical section with a spin count, the thread attempting to acquire the critical section will spin (busy-wait) for that many processor cycles. This can result in better performance than sleeping, as the number of cycles to perform a context switch to another thread can be much higher than the number of cycles taken by the owning thread to release the mutex
if the spin count expires, the mutex will be allocated
when the owning thread releases the critical section, it is required to check if the mutex is allocated, if it is then it will set the mutex to release a waiting thread
In linux, I think that they have a "spin lock" that serves a similar purpose to the critical section with a spin count.
Critical Section and Mutex are not Operating system specific, their concepts of multithreading/multiprocessing.
Critical Section
Is a piece of code that must only run by it self at any given time (for example, there are 5 threads running simultaneously and a function called "critical_section_function" which updates a array... you don't want all 5 threads updating the array at once. So when the program is running critical_section_function(), none of the other threads must run their critical_section_function.
mutex*
Mutex is a way of implementing the critical section code (think of it like a token... the thread must have possession of it to run the critical_section_code)
A mutex is an object that a thread can acquire, preventing other threads from acquiring it. It is advisory, not mandatory; a thread can use the resource the mutex represents without acquiring it.
A critical section is a length of code that is guaranteed by the operating system to not be interupted. In pseudo-code, it would be like:
StartCriticalSection();
DoSomethingImportant();
DoSomeOtherImportantThing();
EndCriticalSection();
The 'fast' Windows equal of critical selection in Linux would be a futex, which stands for fast user space mutex. The difference between a futex and a mutex is that with a futex, the kernel only becomes involved when arbitration is required, so you save the overhead of talking to the kernel each time the atomic counter is modified. That .. can save a significant amount of time negotiating locks in some applications.
A futex can also be shared amongst processes, using the means you would employ to share a mutex.
Unfortunately, futexes can be very tricky to implement (PDF). (2018 update, they aren't nearly as scary as they were in 2009).
Beyond that, its pretty much the same across both platforms. You're making atomic, token driven updates to a shared structure in a manner that (hopefully) does not cause starvation. What remains is simply the method of accomplishing that.
In Windows, a critical section is local to your process. A mutex can be shared/accessed across processes. Basically, critical sections are much cheaper. Can't comment on Linux specifically, but on some systems they're just aliases for the same thing.
Just to add my 2 cents, critical Sections are defined as a structure and operations on them are performed in user-mode context.
ntdll!_RTL_CRITICAL_SECTION
+0x000 DebugInfo : Ptr32 _RTL_CRITICAL_SECTION_DEBUG
+0x004 LockCount : Int4B
+0x008 RecursionCount : Int4B
+0x00c OwningThread : Ptr32 Void
+0x010 LockSemaphore : Ptr32 Void
+0x014 SpinCount : Uint4B
Whereas mutex are kernel objects (ExMutantObjectType) created in the Windows object directory. Mutex operations are mostly implemented in kernel-mode. For instance, when creating a Mutex, you end up calling nt!NtCreateMutant in kernel.
Great answer from Michael. I've added a third test for the mutex class introduced in C++11. The result is somewhat interesting, and still supports his original endorsement of CRITICAL_SECTION objects for single processes.
mutex m;
HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
CRITICAL_SECTION critSec;
InitializeCriticalSection(&critSec);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;
// Force code into memory, so we don't see any effects of paging.
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
}
QueryPerformanceCounter(&end);
int totalTimeCS = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
}
QueryPerformanceCounter(&end);
int totalTime = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
m.lock();
m.unlock();
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
m.lock();
m.unlock();
}
QueryPerformanceCounter(&end);
int totalTimeM = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
printf("C++ Mutex: %d Mutex: %d CritSec: %d\n", totalTimeM, totalTime, totalTimeCS);
My results were 217, 473, and 19 (note that my ratio of times for the last two is roughly comparable to Michael's, but my machine is at least four years younger than his, so you can see evidence of increased speed between 2009 and 2013, when the XPS-8700 came out). The new mutex class is twice as fast as the Windows mutex, but still less than a tenth the speed of the Windows CRITICAL_SECTION object. Note that I only tested the non-recursive mutex. CRITICAL_SECTION objects are recursive (one thread can enter them repeatedly, provided it leaves the same number of times).
I found the explanations stating that critical sections protect a code section from being entered by multiple threads quite misleading.
There is no point in protecting code, as code is read only and can't be modified by multiple threads. What one usually wants is to protect data from being modified by multiple threads, leading to incoherent state. Commonly a mutex (or critical section, fulfilling the same purpose) should be associated with some piece of data. Each code section accessing this data should aquire the mutex/critical section and release when it's finished accessing the data. This may be considerably more fine grained than just locking out threads from entering a function. Also, from my experience, locking functions by some synchronisation is much more prone for errors, in particular dead locks. A good article covering that topic can be found here:
https://www.bogotobogo.com/cplusplus/multithreaded4_cplusplus11B.php
So, in summary (recursive) mutexes and critical sections basically fulfill the same purpose, which is rather not protecting code, but protecting data instead.
Critical sections may be implemented more efficiently than plain kernel mutexes. The example given in the first answer is a bit misleading, because it does not depict what the synchronisation primitive is designed for: synchronize access to sth. from multiple threads. The example just measures the trivial case, when the critical section/mutex is never owned by another thread.
While critical sections could be more efficient if, for instance, two threads access data in short, interlocked periods, they could prove less efficient if we got lots of threads accessing the same piece of data. Each thread would spinlock until giving up and waiting for the semaphore, part of the implementation of the critical section.
Such a case should also be considered when measuring execution times.
A C functions is called reentrant if it uses its actual parameters only.
Reentrant functions can be called by multiple threads at the same time.
Example of reentrant function:
int reentrant_function (int a, int b)
{
int c;
c = a + b;
return c;
}
Example of non reentrant function:
int result;
void non_reentrant_function (int a, int b)
{
int c;
c = a + b;
result = c;
}
The C standard library strtok() is not reentrant and can't be used by 2 or more threads at the same time.
Some platform SDK's comes with the reentrant version of strtok() called strtok_r();

Resources