Handling interrupt in character device - linux-kernel

I am trying to correctly register interrupt in kernel for user interface.
Surprisingly, I did not find many examples in kernel for that.
irq handler
static irqreturn_t irq_handler(int irq, void *dev_id)
{
struct el_irq_dev *el_irq = &el_irq_devices[0];
printk("irq in\n");
spin_lock(&el->my_lock,flags);
clear_interrupt()
some_buffer[buf_wr] = ch;
el_irq->buf_wr++;
if (el_irqbuf_wr >= 16)
el_irqbuf_wr = 0;
spin_unlock(&el->my_lock,flags);
wake_up_interruptible(&el->pollw);
return IRQ_HANDLED;
}
ioctl for waiting on interrupts
static long el_device_ioctl( struct file *filp,
unsigned int ioctl_num,
unsigned long ioctl_param)
{
struct el_irq_dev *el_irq = &el_irq_devices[0];
switch (ioctl_num) {
case IOCTL_WAIT_IRQ: <<<---- using ioctl (no poll) to wait on interrupt
wait_event_interruptible(el_irq->pollw, &el_irq->buf_wr != &el_irq->buf_rd) ;
spin_lock(&el_irq->my_lock);
if (el_irq->buf_wr != &el_irq->buf_rd)
{
my_value=some_buffer[el_irq->buf_rd];
el_irq->buf_rd++;
if (el_irq->buf_rd >= 16)
el_irq->buf_rd = 0;
}
spin_unlock(&el_irq->my_lock);
copy_to_user(ioctl_param,&my_value,sizeof(my_value));
default:
break;
}
return 0;
}
My question is:
Should we put the clear of interrupts (clear_interrupt() ) in fpga in the interrupt
before or after the wake_up ? Can we event put the clearing interrupt in the userspace handler (IOCTL_WAIT_IRQ) instead of clearing the interrupt in the
interrupt handler ?
As you can see in the code, I am using cyclic buffer in order to handle cases where the userspace handler is missing interrupts. Is that really required or can we assume that there are no misses ?
In other words, Is it reasnoble to assume that there should never be missed interrupts ? so that the ioctl call should never see more than 1 waiting interrupt ? If yes - maybe I don't need buffer mechanism between the interrupt handler and the ioctl handler.
Thank you,
Ran

Short answer.
It seems reasonable to me to clear interrupts in the user-space handler. It makes some sense to do it as late as possible, after all work has been done, as long as you check again after clearing that there is really no work left to do (some more work might have arrived just before clearing).
The user-space handler might indeed miss interrupts, e.g. if several arrive between calls to IOCTL_WAIT_IRQ. Interrupts might also get "missed" in some sense though if several pieces of work arrive before interrupts are cleared. The stack (hardware and software) should be designed so that this is not a problem. An interrupt should just signal that there is work to be done, and the user-space handler should be able to just do all outstanding work before returning.
You should probably be using spin_lock_irqsave() in your IOCtl code[1].
[1] http://www.makelinux.net/ldd3/chp-5-sect-5

Related

commenting out a printk statement causes crash in a linux device driver test

I'm seeing a weird case in a simple linux driver test(arm64).
The user program calls ioctl of a device driver and passes array 'arg' of uint64_t as argument. By the way, arg[2] contains a pointer to a variable in the app. Below is the code snippet.
case SetRunParameters:
copy_from_user(args, (void __user *)arg, 8*3);
offs = args[2] % PAGE_SIZE;
down_read(&current->mm->mmap_sem);
res = get_user_pages( (unsigned long)args[2], 1, 1, &pages, NULL);
if (res) {
kv_page_addr = kmap(pages);
kv_addr = ((unsigned long long int)(kv_page_addr)+offs);
args[2] = page_to_phys(pages) + offset; // args[2] changed to physical
}
else {
printk("get_user_pages failed!\n");
}
up_read(&current->mm->mmap_sem);
*(vaddr + REG_IOCTL_ARG/4) = virt_to_phys(args); // from axpu_regs.h
printk("ldd:writing %x at %px\n",cmdx,vaddr + REG_IOCTL_CMD/4); // <== line 248. not ok w/o this printk line why?..
*(vaddr + REG_IOCTL_CMD/4) = cmdx; // this command is different from ioctl cmd!
put_page(pages); //page_cache_release(page);
break;
case ...
I have marked line 248 in above code. If I comment out the printk there, a trap occurs and the virtual machine collapses(I'm doing this on a qemu virtual machine). The cmdx is a integer value set according to the ioctl command from the app, and vaddr is the virtual address of the device (obtained from ioremap). If I keep the printk, it works as I expect. What case can make this happen? (cache or tlb?)
Accessing memory-mapped registers by simple C constructs such as *(vaddr + REG_IOCTL_ARG/4) is a bad idea. You might get away with it on some platforms if the access is volatile-qualified, but it won't work reliably or at all on some platforms. The proper way to access memory-mapped registers is via the functions declared by #include <asm/io.h> or #include <linux/io.h>. These will take care of any arch-specific requirements to ensure that writes are properly ordered as far as the CPU is concerned1.
The functions for memory-mapped register access are described in the Linux kernel documentation under Bus-Independent Device Accesses.
This code:
*(vaddr + REG_IOCTL_ARG/4) = virt_to_phys(args);
*(vaddr + REG_IOCTL_CMD/4) = cmdx;
can be rewritten as:
writel(virt_to_phys(args), vaddr + REG_IOCTL_ARG/4);
writel(cmdx, vaddr + REG_IOCTL_CMD/4);
1 Write-ordering for specific bus types such as PCI may need extra code to read a register inbetween writes to different registers if the ordering of the register writes is important. That is because writes are "posted" asynchronously to the PCI bus, and the PCI device may process writes to different registers out of order. An intermediate register read will not be handled by the device until all preceding writes have been handled, so it can be used to enforce ordering of posted writes.

Trap memory accesses inside a standard executable built with MinGW

So my problem sounds like this.
I have some platform dependent code (embedded system) which writes to some MMIO locations that are hardcoded at specific addresses.
I compile this code with some management code inside a standard executable (mainly for testing) but also for simulation (because it takes longer to find basic bugs inside the actual HW platform).
To alleviate the hardcoded pointers, i just redefine them to some variables inside the memory pool. And this works really well.
The problem is that there is specific hardware behavior on some of the MMIO locations (w1c for example) which makes "correct" testing hard to impossible.
These are the solutions i thought of:
1 - Somehow redefine the accesses to those registers and try to insert some immediate function to simulate the dynamic behavior. This is not really usable since there are various ways to write to the MMIO locations (pointers and stuff).
2 - Somehow leave the addresses hardcoded and trap the illegal access through a seg fault, find the location that triggered, extract exactly where the access was made, handle and return. I am not really sure how this would work (and even if it's possible).
3 - Use some sort of emulation. This will surely work, but it will void the whole purpose of running fast and native on a standard computer.
4 - Virtualization ?? Probably will take a lot of time to implement. Not really sure if the gain is justifiable.
Does anyone have any idea if this can be accomplished without going too deep? Maybe is there a way to manipulate the compiler in some way to define a memory area for which every access will generate a callback. Not really an expert in x86/gcc stuff.
Edit: It seems that it's not really possible to do this in a platform independent way, and since it will be only windows, i will use the available API (which seems to work as expected). Found this Q here:
Is set single step trap available on win 7?
I will put the whole "simulated" register file inside a number of pages, guard them, and trigger a callback from which i will extract all the necessary info, do my stuff then continue execution.
Thanks all for responding.
I think #2 is the best approach. I routinely use approach #4, but I use it to test code that is running in the kernel, so I need a layer below the kernel to trap and emulate the accesses. Since you have already put your code into a user-mode application, #2 should be simpler.
The answers to this question may provide help in implementing #2. How to write a signal handler to catch SIGSEGV?
What you really want to do, though, is to emulate the memory access and then have the segv handler return to the instruction after the access. This sample code works on Linux. I'm not sure if the behavior it is taking advantage of is undefined, though.
#include <stdint.h>
#include <stdio.h>
#include <signal.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static void segv_handler(int, siginfo_t *, void *);
int main()
{
struct sigaction action = { 0, };
action.sa_sigaction = segv_handler;
action.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &action, NULL);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static void segv_handler(int, siginfo_t *info, void *ucontext_arg)
{
ucontext_t *ucontext = static_cast<ucontext_t *>(ucontext_arg);
ucontext->uc_mcontext.gregs[REG_RAX] = 1234;
ucontext->uc_mcontext.gregs[REG_RIP] += 2;
}
The code to read the register is written in assembly to ensure that both the destination register and the length of the instruction are known.
This is how the Windows version of prl's answer could look like:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *);
int main()
{
SetUnhandledExceptionFilter(segv_handler);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *ep)
{
// only handle read access violation of REG_ADDR
if (ep->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION ||
ep->ExceptionRecord->ExceptionInformation[0] != 0 ||
ep->ExceptionRecord->ExceptionInformation[1] != (ULONG_PTR)REG_ADDR)
return EXCEPTION_CONTINUE_SEARCH;
ep->ContextRecord->Rax = 1234;
ep->ContextRecord->Rip += 2;
return EXCEPTION_CONTINUE_EXECUTION;
}
So, the solution (code snippet) is as follows:
First of all, i have a variable:
__attribute__ ((aligned (4096))) int g_test;
Second, inside my main function, i do the following:
AddVectoredExceptionHandler(1, VectoredHandler);
DWORD old;
VirtualProtect(&g_test, 4096, PAGE_READWRITE | PAGE_GUARD, &old);
The handler looks like this:
LONG WINAPI VectoredHandler(struct _EXCEPTION_POINTERS *ExceptionInfo)
{
static DWORD last_addr;
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) {
last_addr = ExceptionInfo->ExceptionRecord->ExceptionInformation[1];
ExceptionInfo->ContextRecord->EFlags |= 0x100; /* Single step to trigger the next one */
return EXCEPTION_CONTINUE_EXECUTION;
}
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) {
DWORD old;
VirtualProtect((PVOID)(last_addr & ~PAGE_MASK), 4096, PAGE_READWRITE | PAGE_GUARD, &old);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
This is only a basic skeleton for the functionality. Basically I guard the page on which the variable resides, i have some linked lists in which i hold pointers to the function and values for the address in question. I check that the fault generating address is inside my list then i trigger the callback.
On first guard hit, the page protection will be disabled by the system, but i can call my PRE_WRITE callback where i can save the variable state. Because a single step is issued through the EFlags, it will be followed immediately by a single step exception (which means that the variable was written), and i can trigger a WRITE callback. All the data required for the operation is contained inside the ExceptionInformation array.
When someone tries to write to that variable:
*(int *)&g_test = 1;
A PRE_WRITE followed by a WRITE will be triggered,
When i do:
int x = *(int *)&g_test;
A READ will be issued.
In this way i can manipulate the data flow in a way that does not require modifications of the original source code.
Note: This is intended to be used as part of a test framework and any penalty hit is deemed acceptable.
For example, W1C (Write 1 to clear) operation can be accomplished:
void MYREG_hook(reg_cbk_t type)
{
/** We need to save the pre-write state
* This is safe since we are assured to be called with
* both PRE_WRITE and WRITE in the correct order
*/
static int pre;
switch (type) {
case REG_READ: /* Called pre-read */
break;
case REG_PRE_WRITE: /* Called pre-write */
pre = g_test;
break;
case REG_WRITE: /* Called after write */
g_test = pre & ~g_test; /* W1C */
break;
default:
break;
}
}
This was possible also with seg-faults on illegal addresses, but i had to issue one for each R/W, and keep track of a "virtual register file" so a bigger penalty hit. In this way i can only guard specific areas of memory or none, depending on the registered monitors.

How to send signal from kernel to user space

My kernel module code needs to send signal [def.] to a user land program, to transfer its execution to registered signal handler.
I know how to send signal between two user land processes, but I can not find any example online regarding the said task.
To be specific, my intended task might require an interface like below (once error != 1, code line int a=10 should not be executed):
void __init m_start(){
...
if(error){
send_signal_to_userland_process(SIGILL)
}
int a = 10;
...
}
module_init(m_start())
An example I used in the past to send signal to user space from hardware interrupt in kernel space. That was just as follows:
KERNEL SPACE
#include <asm/siginfo.h> //siginfo
#include <linux/rcupdate.h> //rcu_read_lock
#include <linux/sched.h> //find_task_by_pid_type
static int pid; // Stores application PID in user space
#define SIG_TEST 44
Some "includes" and definitions are needed. Basically, you need the PID of the application in user space.
struct siginfo info;
struct task_struct *t;
memset(&info, 0, sizeof(struct siginfo));
info.si_signo = SIG_TEST;
// This is bit of a trickery: SI_QUEUE is normally used by sigqueue from user space, and kernel space should use SI_KERNEL.
// But if SI_KERNEL is used the real_time data is not delivered to the user space signal handler function. */
info.si_code = SI_QUEUE;
// real time signals may have 32 bits of data.
info.si_int = 1234; // Any value you want to send
rcu_read_lock();
// find the task with that pid
t = pid_task(find_pid_ns(pid, &init_pid_ns), PIDTYPE_PID);
if (t != NULL) {
rcu_read_unlock();
if (send_sig_info(SIG_TEST, &info, t) < 0) // send signal
printk("send_sig_info error\n");
} else {
printk("pid_task error\n");
rcu_read_unlock();
//return -ENODEV;
}
The previous code prepare the signal structure and send it. Bear in mind that you need the application's PID. In my case the application from user space send its PID through ioctl driver procedure:
static long dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
ioctl_arg_t args;
switch (cmd) {
case IOCTL_SET_VARIABLES:
if (copy_from_user(&args, (ioctl_arg_t *)arg, sizeof(ioctl_arg_t))) return -EACCES;
pid = args.pid;
break;
USER SPACE
Define and implement the callback function:
#define SIG_TEST 44
void signalFunction(int n, siginfo_t *info, void *unused) {
printf("received value %d\n", info->si_int);
}
In main procedure:
int fd = open("/dev/YourModule", O_RDWR);
if (fd < 0) return -1;
args.pid = getpid();
ioctl(fd, IOCTL_SET_VARIABLES, &args); // send the our PID as argument
struct sigaction sig;
sig.sa_sigaction = signalFunction; // Callback function
sig.sa_flags = SA_SIGINFO;
sigaction(SIG_TEST, &sig, NULL);
I hope it helps, despite the fact the answer is a bit long, but it is easy to understand.
You can use, e.g., kill_pid(declared in <linux/sched.h>) for send signal to the specified process. To form parameters to it, see implementation of sys_kill (defined as SYSCALL_DEFINE2(kill) in kernel/signal.c).
Note, that it is almost useless to send signal from the kernel to the current process: kernel code should return before user-space program ever sees signal fired.
Your interface is violating the spirit of Linux. Don't do that..... A system call (in particular those related to your driver) should only fail with errno (see syscalls(2)...); consider eventfd(2) or netlink(7) for such asynchronous kernel <-> userland communications (and expect user code to be able to poll(2) them).
A kernel module could fail to be loaded. I'm not familiar with the details (never coded any kernel modules) but this hello2.c example suggests that the module init function can return a non zero error code on failure.
People are really expecting that signals (which is a difficult and painful concept) are behaving as documented in signal(7) and what you want to do does not fit in that picture. So a well behaved kernel module should never asynchronously send any signal to processes.
If your kernel module is not behaving nicely your users would be pissed off and won't use it.
If you want to fork your experimental kernel (e.g. for research purposes), don't expect it to be used a lot; only then could you realistically break signal behavior like you intend to do, and you could code things which don't fit into the kernel module picture (e.g. add a new syscall). See also kernelnewbies.

asio implicit strand and data synchronization

When I read asio source code, I am curious about how asio making data synchronized between threads even a implicit strand was made. These are code in asio:
io_service::run
mutex::scoped_lock lock(mutex_);
std::size_t n = 0;
for (; do_run_one(lock, this_thread, ec); lock.lock())
if (n != (std::numeric_limits<std::size_t>::max)())
++n;
return n;
io_service::do_run_one
while (!stopped_)
{
if (!op_queue_.empty())
{
// Prepare to execute first handler from queue.
operation* o = op_queue_.front();
op_queue_.pop();
bool more_handlers = (!op_queue_.empty());
if (o == &task_operation_)
{
task_interrupted_ = more_handlers;
if (more_handlers && !one_thread_)
{
if (!wake_one_idle_thread_and_unlock(lock))
lock.unlock();
}
else
lock.unlock();
task_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
// Run the task. May throw an exception. Only block if the operation
// queue is empty and we're not polling, otherwise we want to return
// as soon as possible.
task_->run(!more_handlers, this_thread.private_op_queue);
}
else
{
std::size_t task_result = o->task_result_;
if (more_handlers && !one_thread_)
wake_one_thread_and_unlock(lock);
else
lock.unlock();
// Ensure the count of outstanding work is decremented on block exit.
work_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
// Complete the operation. May throw an exception. Deletes the object.
o->complete(*this, ec, task_result);
return 1;
}
}
in its do_run_one, the unlock of mutex are all before execute handler. If there is a implicit strand, handler will not executed concurrent, but the problem is: thread A run a handler which modify data, and thread B run next handler which read the data which had been modified by thread A. Without the protect of mutex, how thread B seen the changes of data made by thread A? The mutex unlocking ahead of handler execution doesn't make a happen-before relationship between threads access the data which handler accessed.
When I go further, the handler execution use a thing called fenced_block:
completion_handler* h(static_cast<completion_handler*>(base));
ptr p = { boost::addressof(h->handler_), h, h };
BOOST_ASIO_HANDLER_COMPLETION((h));
// Make a copy of the handler so that the memory can be deallocated before
// the upcall is made. Even if we're not about to make an upcall, a
// sub-object of the handler may be the true owner of the memory associated
// with the handler. Consequently, a local copy of the handler is required
// to ensure that any owning sub-object remains valid until after we have
// deallocated the memory here.
Handler handler(BOOST_ASIO_MOVE_CAST(Handler)(h->handler_));
p.h = boost::addressof(handler);
p.reset();
// Make the upcall if required.
if (owner)
{
fenced_block b(fenced_block::half);
BOOST_ASIO_HANDLER_INVOCATION_BEGIN(());
boost_asio_handler_invoke_helpers::invoke(handler, handler);
BOOST_ASIO_HANDLER_INVOCATION_END;
}
what is this? I know fence seems a sync primitive which supported by C++11 but this fence is totally writen by asio itself. Does this fenced_block help to do the job of data synchronization?
UPDATED
After I google and read this and this, asio indeed use memory fence primitive to synchronize data in threads, that is more faster than unlock till the handler execute complete(speed difference on x86). In fact Java volatile keyword is implemented by insert memory barrier after write & before read this variable to make happen-before relationship.
If someone could simply describe asio memory fence implemenation or add something I missed or misunderstand, I will accept it.
Before the operation invokes the user handler, Boost.Asio uses a memory fence to provide the appropriate memory reordering without forcing mutual execution of handler execution. Thus, thread B would observe changes to memory that occurred within the context of thread A.
C++03 did not specify requirements for memory visibility with regards to multi-threaded execution. However, C++11 defines these requirements in ยง 1.10 Multi-threaded executions and data races, as well as the Atomic operations and Thread support library sections. Boost and C++11 mutexes do perform the appropriate memory reordering. For other implementations, it is worth checking the mutex library's documentation to verify memory reordering occurs.
Boost.Asio memory fences are an implementation detail, and thus always subject to change. Boost.Asio abstracts itself from the architecture/compiler specific implementations through a series of conditional defines within asio/detail/fenced_block.hpp where only a single memory barrier implementation is included. The underlying implementation is contained within a class, for which a fenced_block alias is created via a typedef.
Here is a relevant excerpt:
#elif defined(__GNUC__) && (defined(__hppa) || defined(__hppa__))
# include "asio/detail/gcc_hppa_fenced_block.hpp"
#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
# include "asio/detail/gcc_x86_fenced_block.hpp"
#elif ...
...
namespace asio {
namespace detail {
...
#elif defined(__GNUC__) && (defined(__hppa) || defined(__hppa__))
typedef gcc_hppa_fenced_block fenced_block;
#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
typedef gcc_x86_fenced_block fenced_block;
#elif ...
...
} // namespace detail
} // namespace asio
The implementations of the the memory barriers are specific to the architecture and compilers. Boost.Asio has a family of asio/detail/*_fenced_blocked.hpp header files. For example, the win_fenced_block uses InterlockedExchange for Borland; otherwise it uses the xchg assembly instruction, which has an implicit lock prefix when used with a memory address. For gcc_x86_fenced_block, Boost.Asio uses the memory assembly instruction.
If you find yourself needing to use a fence, then consider the Boost.Atomic library. Introduced in Boost 1.53, Boost.Atomic provides an implementation of thread and signal fences based the C++11 standard. Boost.Asio has been using its own implementation of memory fences prior to the Boost.Atomic being added to Boost. Also, the Boost.Asio fences are scoped based. fenced_block will perform an acquire in its constructor, and a release in its destructor.

Almost lockless producer consumer

I have producer consumer problem to solve with slight modification - there are many parallel producers but only one consumer in one parallel thread. When producer has no place in buffer then it simply ignore element (without waiting on consumer). I have writtent some C pseudocode:
struct Element
{
ULONG content;
volatile LONG bNew;
}
ULONG max_count = 10;
Element buffer* = calloc(max_count, sizeof(Element));
volatile LONG producer_idx = 0;
LONG consumer_idx = 0;
EVENT NotEmpty;
BOOLEAN produce(ULONG content)
{
LONG idx = InterlockedIncrement(&consumer_idx) % max_count;
if(buffer[idx].bNew)
return FALSE;
buffer[idx].content = content;
buffer[idx].bNew = TRUE;
SetEvent(NotEmpty);
return TRUE;
}
void consume_thread()
{
while(TRUE)
{
Wait(NotEmpty);
while(buffer[consumer_idx].bNew)
{
ULONG content = buffer[consumer_idx].content;
InterlockedExchange(&buffer[consumer_idx].bNew, FALSE);
//Simple mechanism for preventing producer_idx overflow
LONG tmp = producer_idx;
InterlockedCompareExchange(&producer_idx, tmp%maxcount, tmp);
consumer_idx = (consumer_idx+1)%max_count;
doSth(content);
}
}
}
I am no 100% sure that this code is correct. Can you see any problems that could occur? Or maybe this code could be written in better way?
Don't use global variable to accomplish your goal, especially in multithreaded application!!! Use Semaphore instead and don't do Lock but TryLock instead. If TryLock fails it means there's no room for another element, so you can skip it.
Here you can find something to read about semaphores in WinAPI because you would probably use it:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686946(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/ms687032(v=vs.85).aspx
You can achieve TryLock functionality by passing 0 as timeout to WaitForSingleObject function.
Please, read this: http://en.wikipedia.org/wiki/Memory_barrier
The C and C++ standards do not address multiple threads (or multiple
processors), and as such, the usefulness of volatile depends on the
compiler and hardware. Although volatile guarantees that the volatile
reads and volatile writes will happen in the exact order specified in
the source code, the compiler may generate code (or the CPU may
re-order execution) such that a volatile read or write is reordered
with regard to non-volatile reads or writes, thus limiting its
usefulness as an inter-thread flag or mutex. Moreover, it is not
guaranteed that volatile reads and writes will be seen in the same
order by other processors due to caching, cache coherence protocol and
relaxed memory ordering, meaning volatile variables alone may not even
work as inter-thread flags or mutexes.
So in common case just volatile won't work for C.
But it could work for some specific compilers/hardware and other languages (Java 5, for example).
See also Is function call a memory barrier?

Resources