When MPI_Send doesn't block - amazon-ec2

I have used some code that implements manual MPI broadcast, basically a demo that unicasts an integer from root to all other nodes. Of course, unicasting to many nodes is less efficient than MPI_Bcast() but I just want to check how things work.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
void my_bcast(void* data, int count, MPI::Datatype datatype, int root, MPI::Intracomm communicator) {
int world_size = communicator.Get_size();
int world_rank = communicator.Get_rank();
if (world_rank == root) {
// If we are the root process, send our data to everyone
int i;
for (i = 0; i < world_size; i++) {
if (i != world_rank) {
communicator.Send(data, count, datatype, i, 0);
}
}
} else {
// If we are a receiver process, receive the data from the root
communicator.Recv(data, count, datatype, root, 0);
}
}
int main(int argc, char** argv) {
MPI::Init();
int world_rank = MPI::COMM_WORLD.Get_rank();
int data;
if (world_rank == 0) {
data = 100;
printf("Process 0 broadcasting data %d\n", data);
my_bcast(&data, 1, MPI::INT, 0, MPI::COMM_WORLD);
} else {
my_bcast(&data, 1, MPI::INT, 0, MPI::COMM_WORLD);
printf("Process %d received data %d from root process\n", world_rank, data);
}
MPI::Finalize();
}
What I noticed is that if I remove the check that the root doesn't send to itself,
if (i != world_rank) {
...
}
the program still works and doesn't block whereas the default behavior of MPI_Send() is supposed to be blocking i.e. to wait until the data has been received at the other end. But MPI_Recv() is never invoked by the root. Can someone explain why this is happening?
I run the code from the root with the following command (the cluster is set up on Amazon EC2 and using NFS as shared storage among the nodes and all machines have Open MPI 1.10.2 installed)
mpirun -mca btl ^openib -mca plm_rsh_no_tree_spawn 1 /EC2_NFS/my_bcast
The C file is compiled with
mpic++ my_bcast.c
and mpic++ version is 5.4.0.
The code is taken from www.mpitutorial.com

You are mistaking blocking for synchronous behaviour. Blocking means that the call does not return until the operation has completed. The standard send operation (MPI_Send) completes once the supplied buffer is free to be reused by the program. This means either that the message is fully in transit to the receiver or that it was stored internally by the MPI library for later delivery (buffered send). The buffering behaviour is implementation-specific, but most libraries will buffer messages the size of a single integer. Force the synchronous mode by using MPI_Ssend (or the C++ equivalent) to have your program hang.
Please note that the C++ MPI bindings are no longer part of the standard and should not be used in the development of new software. Use the C bindings MPI_Blabla instead.

Related

boost asio: Is it thread safe to call tcp::socket::async_read_some() when handler is protected by a strand

I'm struggle to full understand Boost ASIO and strands. I was under the impression that the call to socket::async_read_some() was safe as long as the handler was wrapped in a strand. This appears not to be the case since the code eventually throws an exception.
In my situation a third party library is making the Session::readSome() calls. I'm using a reactor pattern with the ASIO layer under the third party library. When data arrives on the socket the 3rd party is called to do the read. The pattern is used since it is necessary to abort the read operation at any time and have the 3rd party library error out and return its thread. The third party expected a blocking read so the code mimics it with a conditional variable.
Given the example below what is the proper way to do this? Do I need to wrap the async_read_some() call in a dispatch() or post() so it runs through a strand too?
Note: Compiler is c++14 ;-(
Example representative code:
Session::Session (ba::io_context& ioContext):
m_sessionStrand ( ioContext.get_executor() ),
m_socket ( m_sessionStrand )
{}
int32_t Session::readSome (unsigned char* pBuffer, uint32_t bufferSizeToRead, boost::system::error_code& errorCode)
{
// The 3d party expects a synchronous read so we mimic the behavior
// with a async_read and then wait for the results. With this pattern
// we can unblock the read elsewhere - for or example calling close on the socket -
// and still give the 3d party the illusion of a synchronous read.
// In such a cases the 3rd party will receive an error code
// on the read and return it's thread.
// Nothing to do
if ( bufferSizeToRead == 0) return 0;
// Create a mutable buffer
ba::mutable_buffer buffer (pBuffer, bufferSizeToRead);
std::size_t result = 0;
errorCode.clear();
// Setup conditional
m_readerPause.exchange(true);
auto readHandler = [&result, &errorCode, self=shared_from_this()](boost::system::error_code ec, std::size_t bytesRead)
{
result = bytesRead;
errorCode = ec;
// Signal that we got results
std::unique_lock<std::mutex> lock{m_readerMutex};
m_readerPause.exchange(false);
m_readerPauseCV.notify_all();
};
m_socket.async_read_some(buffer, ba::bind_executor (m_sessionStrand, readHandler));
// We pause the 3rd party read thread until we get the read results back - or an error occurs
{
std::unique_lock<std::mutex> lock{m_readerMutex};
m_readerPauseCV.wait (lock, [this]{ return !m_readerPause.load(std::memory_order_acquire); } );
}
return result;
}
The exception occurs in epoll_reactor.ipp. There is a race condition between the read and closing the socket.
void epoll_reactor::start_op(int op_type, socket_type descriptor,
epoll_reactor::per_descriptor_data& descriptor_data, reactor_op* op,
bool is_continuation, bool allow_speculative)
{
if (!descriptor_data)
{
op->ec_ = boost::asio::error::bad_descriptor;
post_immediate_completion(op, is_continuation);
return;
}
mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);
if (descriptor_data->shutdown_) //!! SegFault here: descriptor_data == NULL*
{
post_immediate_completion(op, is_continuation);
return;
}
...
}
Thanks in advance for any insights in the proper way to handle this situation using ASIO.
The strand doesn't "protect" the handler. Instead, it protects some shared state (which you control) by synchronizing handler execution. It's exactly like a mutex for async execution.
According to this logic all code running on the strand can touch the shared resources, and conversely, code not guaranteed to be on the strand can not be allowed to touch them.
In your code, the shared resources consist of at least buffer, result, m_socket. It would be more complete to include the m_sessionStrand, m_readerPauseCV, m_readerMutex, m_readerPause but all of these are implicitly threadsafe the way they are used¹.
Your code looks to do things safely in these regards. However it makes a few unfortunate detours that make it harder than necessary to check/reason about the code:
it uses more (local) shared state to communicate results from the handler
it doesn't make explicit what the mutex and/or the strand protect
it employs both a mutex and a strand which conceptually compete for the same responsibility
it employs both a condition and an atomic bool, which again compete for the same responsibility
it does manual strand binding, which muddies the expectations about what the native executor for the m_socket object is expected to be
the initial read is not protected. This means that if Session::readSome is invoked from a "wild" thread, it will use member functions without synchronizing with any other operations that may be pending on the m_socket.
the atomic_bool mutations are spelled in Very Convoluted Ways(TM), which serve to show you (presumably) understand the memory model, but make the code harder to review without tangible merit. Clearly, the blocking synchronization will (far) outweigh any benefit of explicit memory acquisition order. I suggest to at least "normalize" the spelling as atomic_bool was explicitly designed to afford:
//m_readerPause.exchange(true);
m_readerPause = true;
and
m_readerPauseCV.wait(lock, [this] { return !m_readerPause; });
since you are emulating blocking IO, there is no merit capturing shared_from_this() in the lambda. Lifetime should be guaranteed by the calling party any ways.
Interestingly, you didn't show this capture, which is required for the lambda to compile, assuming you didn't use global variables.
Kudos for explicitly clearing the error_code output variable. This is oft forgotten. Technically, you did forget about with the (questionable?) early exit when (bufferSizeToRead == 0)... You might have a slightly unorthodox caller contract where this makes sense.
To be generic I'd suggest to perform the zero-length read as it might behave differently depending on the transport connected.
Last, but not least, m_socket.[async_]read_some is rarely what you require on application protocol level. I'll leave this one to you, as you might have this exceptional edge-case scenario.
Simplifying
Conceptually, I'd like to write:
int32_t Session::readSome(unsigned char* buf, uint32_t size, error_code& ec) {
ec.clear();
size_t result = 0;
std::tie(ec, result) = m_socket
.async_read_some(ba::buffer(buf, size),
ba::as_tuple(ba::use_future))
.get();
return result;
}
This uses futures to get the blocking behaviour while being cancelable. Sadly, contrary to expectation there is currently a limitation that prevents combining as_tuple and use_future.
So, we have to either ignore partial success scenarios (significant result when !ec):
int32_t Session::readSome(unsigned char* buf, uint32_t size, error_code& ec) try {
ec.clear();
return m_socket
.async_read_some(ba::buffer(buf, size), ba::use_future)
.get();
} catch (boost::system::system_error const& se) {
ec = se.code();
return 0;
}
I suspect that member-async_read_some doesn't have a partial success mode. However, let's still give it thought, seeing that I warned before that async_read_some is rarely what you need anyways:
int32_t Session::readSome(unsigned char* buf, uint32_t size, error_code& ec) {
std::promise<std::tuple<size_t, error_code> > p;
m_socket.async_read_some(ba::buffer(buf, size), [&p](error_code ec_, size_t n_) { p.set_value({n_, ec_}); });
size_t result;
std::tie(result, ec) = p.get_future().get();
return result;
}
Still considerably easier.
Interim Result
Self contained example with the current approach:
Live On Coliru
#include <boost/asio.hpp>
namespace ba = boost::asio;
using ba::ip::tcp;
using boost::system::error_code;
using CharT = /*unsigned*/ char; // for ease of output...
struct Session : std::enable_shared_from_this<Session> {
tcp::socket m_socket;
Session(ba::any_io_executor ex) : m_socket(make_strand(ex)) {
m_socket.connect({{}, 7878});
}
int32_t readSome(CharT* buf, uint32_t size, error_code& ec) {
std::promise<std::tuple<size_t, error_code>> p;
m_socket.async_read_some(ba::buffer(buf, size), [&p](error_code ec_, size_t n_) {
p.set_value({n_, ec_});
});
size_t result;
std::tie(result, ec) = p.get_future().get();
return result;
}
};
#include <iomanip>
#include <iostream>
int main() {
ba::thread_pool ioc;
auto s = std::make_shared<Session>(ioc.get_executor());
error_code ec;
CharT data[10];
while (auto n = s->readSome(data, 10, ec))
std::cout << "Received " << quoted(std::string(data, n)) << " (" << ec.message() << ")\n";
ioc.join();
}
Testing with
g++ -std=c++14 -O2 -Wall -pedantic -pthread main.cpp
for resp in FOO LONG_BAR_QUX_RESPONSE; do nc -tln 7878 -w 0 <<< $resp; done&
set -x
sleep .2; ./a.out
sleep .2; ./a.out
Prints
+ sleep .2
+ ./a.out
Received "FOO
" (Success)
+ sleep .2
+ ./a.out
Received "LONG_BAR_Q" (Success)
Received "UX_RESPONS" (Success)
Received "E
" (Success)
External Synchronization (Cancellation?)
Now, code not show implies that other operations may act on m_socket, if at least only to cancel operations in flight³. If this situation arises you have add the missing synchronization, either using the mutex or the strand.
I suggest not introducing the competing synchronization mechanism, even though not "incorrect". It will
lead to simpler code
allow you to solidify your understanding of the use of the strand.
So, let's make sure that the operation runs on the strand:
int32_t readSome(CharT* buf, uint32_t size, error_code& ec) {
std::promise<size_t> p;
post(m_socket.get_executor(), [&] {
m_socket.async_read_some(ba::buffer(buf, size),
[&](error_code ec_, size_t n_) { ec = ec_; p.set_value(n_); });
});
return p.get_future().get();
}
void cancel() {
post(m_socket.get_executor(),
[self = shared_from_this()] { self->m_socket.cancel(); });
}
See it Live On Coliru
Exercising Cancellation
int main() {
ba::thread_pool ioc(1);
auto s = std::make_shared<Session>(ioc.get_executor());
std::thread th([&] {
std::this_thread::sleep_for(5s);
s->cancel();
});
error_code ec;
CharT data[10];
do {
auto n = s->readSome(data, 10, ec);
std::cout << "Received " << quoted(std::string(data, n)) << " (" << ec.message() << ")\n";
} while (!ec);
ioc.join();
th.join();
}
Again, Live On Coliru
¹ Technically in a multi-thread situation you need to notify the CV under the lock to allow for fair scheduling, i.e. to prevent waiter starvation. However your scenario is so isolated that you can get away with being somewhat sloppy.
² by default tcp::socket type-erases the executor with any_io_executor, but you could use basic_stream_socket<tcp, strand<io_context::executor_type> > to remove that cost if your executor type is statically known
³ Of course, POSIX sockets include full duplex scenarios, where read and write operations can be in flight simultaneoulsy.
UPDATE: redirect_error
Just re-discovered redirect_error which allows something close to as_tuple:
auto readSome(CharT* buf, uint32_t size, error_code& ec) {
return m_socket
.async_read_some(ba::buffer(buf, size),
ba::redirect_error(ba::use_future, ec))
.get();
}
void cancel() { m_socket.cancel(); }
This only suffices when readSome and cancel are guaranteed to be invoked on the strand.

Trap memory accesses inside a standard executable built with MinGW

So my problem sounds like this.
I have some platform dependent code (embedded system) which writes to some MMIO locations that are hardcoded at specific addresses.
I compile this code with some management code inside a standard executable (mainly for testing) but also for simulation (because it takes longer to find basic bugs inside the actual HW platform).
To alleviate the hardcoded pointers, i just redefine them to some variables inside the memory pool. And this works really well.
The problem is that there is specific hardware behavior on some of the MMIO locations (w1c for example) which makes "correct" testing hard to impossible.
These are the solutions i thought of:
1 - Somehow redefine the accesses to those registers and try to insert some immediate function to simulate the dynamic behavior. This is not really usable since there are various ways to write to the MMIO locations (pointers and stuff).
2 - Somehow leave the addresses hardcoded and trap the illegal access through a seg fault, find the location that triggered, extract exactly where the access was made, handle and return. I am not really sure how this would work (and even if it's possible).
3 - Use some sort of emulation. This will surely work, but it will void the whole purpose of running fast and native on a standard computer.
4 - Virtualization ?? Probably will take a lot of time to implement. Not really sure if the gain is justifiable.
Does anyone have any idea if this can be accomplished without going too deep? Maybe is there a way to manipulate the compiler in some way to define a memory area for which every access will generate a callback. Not really an expert in x86/gcc stuff.
Edit: It seems that it's not really possible to do this in a platform independent way, and since it will be only windows, i will use the available API (which seems to work as expected). Found this Q here:
Is set single step trap available on win 7?
I will put the whole "simulated" register file inside a number of pages, guard them, and trigger a callback from which i will extract all the necessary info, do my stuff then continue execution.
Thanks all for responding.
I think #2 is the best approach. I routinely use approach #4, but I use it to test code that is running in the kernel, so I need a layer below the kernel to trap and emulate the accesses. Since you have already put your code into a user-mode application, #2 should be simpler.
The answers to this question may provide help in implementing #2. How to write a signal handler to catch SIGSEGV?
What you really want to do, though, is to emulate the memory access and then have the segv handler return to the instruction after the access. This sample code works on Linux. I'm not sure if the behavior it is taking advantage of is undefined, though.
#include <stdint.h>
#include <stdio.h>
#include <signal.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static void segv_handler(int, siginfo_t *, void *);
int main()
{
struct sigaction action = { 0, };
action.sa_sigaction = segv_handler;
action.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &action, NULL);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static void segv_handler(int, siginfo_t *info, void *ucontext_arg)
{
ucontext_t *ucontext = static_cast<ucontext_t *>(ucontext_arg);
ucontext->uc_mcontext.gregs[REG_RAX] = 1234;
ucontext->uc_mcontext.gregs[REG_RIP] += 2;
}
The code to read the register is written in assembly to ensure that both the destination register and the length of the instruction are known.
This is how the Windows version of prl's answer could look like:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *);
int main()
{
SetUnhandledExceptionFilter(segv_handler);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *ep)
{
// only handle read access violation of REG_ADDR
if (ep->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION ||
ep->ExceptionRecord->ExceptionInformation[0] != 0 ||
ep->ExceptionRecord->ExceptionInformation[1] != (ULONG_PTR)REG_ADDR)
return EXCEPTION_CONTINUE_SEARCH;
ep->ContextRecord->Rax = 1234;
ep->ContextRecord->Rip += 2;
return EXCEPTION_CONTINUE_EXECUTION;
}
So, the solution (code snippet) is as follows:
First of all, i have a variable:
__attribute__ ((aligned (4096))) int g_test;
Second, inside my main function, i do the following:
AddVectoredExceptionHandler(1, VectoredHandler);
DWORD old;
VirtualProtect(&g_test, 4096, PAGE_READWRITE | PAGE_GUARD, &old);
The handler looks like this:
LONG WINAPI VectoredHandler(struct _EXCEPTION_POINTERS *ExceptionInfo)
{
static DWORD last_addr;
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) {
last_addr = ExceptionInfo->ExceptionRecord->ExceptionInformation[1];
ExceptionInfo->ContextRecord->EFlags |= 0x100; /* Single step to trigger the next one */
return EXCEPTION_CONTINUE_EXECUTION;
}
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) {
DWORD old;
VirtualProtect((PVOID)(last_addr & ~PAGE_MASK), 4096, PAGE_READWRITE | PAGE_GUARD, &old);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
This is only a basic skeleton for the functionality. Basically I guard the page on which the variable resides, i have some linked lists in which i hold pointers to the function and values for the address in question. I check that the fault generating address is inside my list then i trigger the callback.
On first guard hit, the page protection will be disabled by the system, but i can call my PRE_WRITE callback where i can save the variable state. Because a single step is issued through the EFlags, it will be followed immediately by a single step exception (which means that the variable was written), and i can trigger a WRITE callback. All the data required for the operation is contained inside the ExceptionInformation array.
When someone tries to write to that variable:
*(int *)&g_test = 1;
A PRE_WRITE followed by a WRITE will be triggered,
When i do:
int x = *(int *)&g_test;
A READ will be issued.
In this way i can manipulate the data flow in a way that does not require modifications of the original source code.
Note: This is intended to be used as part of a test framework and any penalty hit is deemed acceptable.
For example, W1C (Write 1 to clear) operation can be accomplished:
void MYREG_hook(reg_cbk_t type)
{
/** We need to save the pre-write state
* This is safe since we are assured to be called with
* both PRE_WRITE and WRITE in the correct order
*/
static int pre;
switch (type) {
case REG_READ: /* Called pre-read */
break;
case REG_PRE_WRITE: /* Called pre-write */
pre = g_test;
break;
case REG_WRITE: /* Called after write */
g_test = pre & ~g_test; /* W1C */
break;
default:
break;
}
}
This was possible also with seg-faults on illegal addresses, but i had to issue one for each R/W, and keep track of a "virtual register file" so a bigger penalty hit. In this way i can only guard specific areas of memory or none, depending on the registered monitors.

How to send signal from kernel to user space

My kernel module code needs to send signal [def.] to a user land program, to transfer its execution to registered signal handler.
I know how to send signal between two user land processes, but I can not find any example online regarding the said task.
To be specific, my intended task might require an interface like below (once error != 1, code line int a=10 should not be executed):
void __init m_start(){
...
if(error){
send_signal_to_userland_process(SIGILL)
}
int a = 10;
...
}
module_init(m_start())
An example I used in the past to send signal to user space from hardware interrupt in kernel space. That was just as follows:
KERNEL SPACE
#include <asm/siginfo.h> //siginfo
#include <linux/rcupdate.h> //rcu_read_lock
#include <linux/sched.h> //find_task_by_pid_type
static int pid; // Stores application PID in user space
#define SIG_TEST 44
Some "includes" and definitions are needed. Basically, you need the PID of the application in user space.
struct siginfo info;
struct task_struct *t;
memset(&info, 0, sizeof(struct siginfo));
info.si_signo = SIG_TEST;
// This is bit of a trickery: SI_QUEUE is normally used by sigqueue from user space, and kernel space should use SI_KERNEL.
// But if SI_KERNEL is used the real_time data is not delivered to the user space signal handler function. */
info.si_code = SI_QUEUE;
// real time signals may have 32 bits of data.
info.si_int = 1234; // Any value you want to send
rcu_read_lock();
// find the task with that pid
t = pid_task(find_pid_ns(pid, &init_pid_ns), PIDTYPE_PID);
if (t != NULL) {
rcu_read_unlock();
if (send_sig_info(SIG_TEST, &info, t) < 0) // send signal
printk("send_sig_info error\n");
} else {
printk("pid_task error\n");
rcu_read_unlock();
//return -ENODEV;
}
The previous code prepare the signal structure and send it. Bear in mind that you need the application's PID. In my case the application from user space send its PID through ioctl driver procedure:
static long dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
ioctl_arg_t args;
switch (cmd) {
case IOCTL_SET_VARIABLES:
if (copy_from_user(&args, (ioctl_arg_t *)arg, sizeof(ioctl_arg_t))) return -EACCES;
pid = args.pid;
break;
USER SPACE
Define and implement the callback function:
#define SIG_TEST 44
void signalFunction(int n, siginfo_t *info, void *unused) {
printf("received value %d\n", info->si_int);
}
In main procedure:
int fd = open("/dev/YourModule", O_RDWR);
if (fd < 0) return -1;
args.pid = getpid();
ioctl(fd, IOCTL_SET_VARIABLES, &args); // send the our PID as argument
struct sigaction sig;
sig.sa_sigaction = signalFunction; // Callback function
sig.sa_flags = SA_SIGINFO;
sigaction(SIG_TEST, &sig, NULL);
I hope it helps, despite the fact the answer is a bit long, but it is easy to understand.
You can use, e.g., kill_pid(declared in <linux/sched.h>) for send signal to the specified process. To form parameters to it, see implementation of sys_kill (defined as SYSCALL_DEFINE2(kill) in kernel/signal.c).
Note, that it is almost useless to send signal from the kernel to the current process: kernel code should return before user-space program ever sees signal fired.
Your interface is violating the spirit of Linux. Don't do that..... A system call (in particular those related to your driver) should only fail with errno (see syscalls(2)...); consider eventfd(2) or netlink(7) for such asynchronous kernel <-> userland communications (and expect user code to be able to poll(2) them).
A kernel module could fail to be loaded. I'm not familiar with the details (never coded any kernel modules) but this hello2.c example suggests that the module init function can return a non zero error code on failure.
People are really expecting that signals (which is a difficult and painful concept) are behaving as documented in signal(7) and what you want to do does not fit in that picture. So a well behaved kernel module should never asynchronously send any signal to processes.
If your kernel module is not behaving nicely your users would be pissed off and won't use it.
If you want to fork your experimental kernel (e.g. for research purposes), don't expect it to be used a lot; only then could you realistically break signal behavior like you intend to do, and you could code things which don't fit into the kernel module picture (e.g. add a new syscall). See also kernelnewbies.

How Can synchronize data between differernt cores on Xeon (linux how to use memory barriers)

I wrote a simple program to test memory Synchronization. Use a global queue to share with two
processes, and bind two processes to different cores. my code is blew.
#include<stdio.h>
#include<sched.h>
#define __USE_GNU
void bindcpu(int pid) {
int cpuid;
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
if (pid > 0) {
cpuid = 1;
} else {
cpuid = 5;
}
CPU_SET(cpuid, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
printf("warning: could not set CPU affinity, continuing...\n");
}
}
#define Q_LENGTH 512
int g_queue[512];
struct point {
int volatile w;
int volatile r;
};
volatile struct point g_p;
void iwrite(int x) {
while (g_p.r == g_p.w);
sleep(0.1);
g_queue[g_p.w] = x;
g_p.w = (g_p.w + 1) % Q_LENGTH;
printf("#%d!%d", g_p.w, g_p.r);
}
void iread(int *x) {
while (((g_p.r + 1) % Q_LENGTH) == g_p.w);
*x = g_queue[g_p.r];
g_p.r = (g_p.r + 1) % Q_LENGTH;
printf("-%d*%d", g_p.r, g_p.w);
}
int main(int argc, char * argv[]) {
//int num = sysconf(_SC_NPROCESSORS_CONF);
int pid;
pid = fork();
g_p.r = Q_LENGTH;
bindcpu(pid);
int i = 0, j = 0;
if (pid > 0) {
printf("call iwrite \0");
while (1) {
iread(&j);
}
} else {
printf("call iread\0");
while (1) {
iwrite(i);
i++;
}
}
}
The data between the two processesIntel(R) Xeon(R) CPU E3-1230 and two cores didn't synchronized.
CPU: Intel(R) Xeon(R) CPU E3-1230
OS: 3.8.0-35-generic #50~precise1-Ubuntu SMP
I want to know beyond IPC How I can synchronize the data between the different cores in user
space ?
If you are wanting your application to manipulate the cpus shared-cache in order to accomplish IPC I don't believe you will be able to do that.
Chapter 9 of "Linux Kernel Development Second Edition" has information on synchronizing multi-threaded applications (including atomic operations, semiphores, barriers, etc...):
http://www.makelinux.net/books/lkd2/ch09
so you may get some ideas on what you are looking for there.
here is a decent write up for Intel® Smart Cache "Software Techniques for Shared-Cache Multi-Core Systems": http://archive.is/hm0y
here are some stackoverflow questions/answers that may help you find the information you are looking for:
Storing C/C++ variables in processor cache instead of system memory
C++: Working with the CPU cache
Understanding how the CPU decides what gets loaded into cache memory
Sorry for bombarding you with links but this is the best I can do without a clearer understanding of what you are looking to accomplish.
I suggest reading "Volatile: Almost Useless for Multi-Threaded Programming" for why volatile should be removed from the example code. Instead, use C11 or C++11 atomic operations. See also the Fenced Data Transfer example in of the TBB Design Patterns Manual.
Below I show the parts of the question example that I changed to use C++11 atomics. I compiled it with g++ 4.7.2.
#include <atomic>
...
struct point g_p;
struct point {
std::atomic<int> w;
std::atomic<int> r;
};
void iwrite(int x) {
int w = g_p.w.load(std::memory_order_relaxed);
int r;
while ((r=g_p.r.load(std::memory_order_acquire)) == w);
sleep(0.1);
g_queue[w] = x;
w = (w+1)%Q_LENGTH;
g_p.w.store( w, std::memory_order_release);
printf("#%d!%d", w, r);
}
void iread(int *x) {
int r = g_p.r.load(std::memory_order_relaxed);
int w;
while (((r + 1) % Q_LENGTH) == (w=g_p.w.load(std::memory_order_acquire)));
*x = g_queue[r];
g_p.r.store( (r + 1) % Q_LENGTH, std::memory_order_release );
printf("-%d*%d", r, w);
}
The key changes are:
I removed "volatile" everywhere.
The members of struct point are declared as std::atomic
Some loads and stores of g_p.r and g_p.w are fenced. Others are hoisted.
When loading a variable modified by another thread, the code "snapshots" it into a local variable.
The code uses "relaxed load" (no fence) where a thread loads a variable that no other thread modifies. I hoisted those loads out of the spin loops since there is no point in repeating them.
The code uses "acquiring load" where a thread loads a "message is ready" indicator that is set by another thread, and uses a "releasing store" where it is storing a "message is ready" indicator" to be read by another thread. The release is necessary to ensure that the "message" (queue data) is written before the "ready" indicator (member of g_p) is written. The acquire is likewise necessary to ensure that the "message" is read after the "ready" indicator is seen.
The snapshots are used so that the printf reports the value that the thread actually used, as opposed to some new value that appeared later. In general I like to use the snapshot style for two reasons. First, touching shared memory can be expensive because it often requires cache-line transfers. Second, the style gives me a stable value to use locally without having to worry that a reread might return a different value.

Is there a way to force windows to cache a file?

Is there like batch command or something that will force windows to cache that file? I am trying to create a game preloader that loads certain game files into cache before starting the game. Is there any way I can do this?
updated int main code:
int main(int argc, const char** argv)
{
if(argc >= 2) for(int i = 1; argv[i]; ++i) pf("C:\\Games\World_of_Tanks\res\packages\gui.pkg"[i]);
return 0;
}
All you need to do is load the files, either using ReadFile or by memory mapping the files and touching every page (in fact, due to allocation granularity every 16th page suffices, but in theory you should be touching every page).
Memory mapping is faster and more cache-friendly, since you do not need to allocate extra memory to hold the data (which you aren't going to use for anything useful!). The OS will reuse the same physical memory for the cache and for the virtual memory that your process can see.
Several mainstream applications, including Microsoft Office and Adobe Reader do exactly that to launch faster. It's those "delayed start" services that keep your harddisk light flashing for a dozen seconds after you log in.
Do note, however, that while you can force Windows1 to cache files that way, but you cannot force it to keep the files in the cache indefinitively. If there is not enough physical RAM available, the system will throw away cache contents in order to satisfy application demands.
EDIT: Minimum working example implementation using filemapping:
#include <windows.h>
#include <cstdio>
void pf(const char* name)
{
HANDLE file = CreateFile(name, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if(file == INVALID_HANDLE_VALUE) { printf("couldn't open %s\n", name); return; };
unsigned int len = GetFileSize(file, 0);
HANDLE mapping = CreateFileMapping(file, 0, PAGE_READONLY, 0, 0, 0);
if(mapping == 0) { printf("couldn't map %s\n", name); return; }
const char* data = (const char*) MapViewOfFile(mapping, FILE_MAP_READ, 0, 0, 0);
if(data)
{
printf("prefetching %s... ", name);
// need volatile or need to use result - compiler will otherwise optimize out whole loop
volatile unsigned int touch = 0;
for(unsigned int i = 0; i < len; i += 4096)
touch += data[i];
}
else
printf("couldn't create view of %s\n", name);
UnmapViewOfFile(data);
CloseHandle(mapping);
CloseHandle(file);
}
int main(int argc, const char** argv)
{
if(argc >= 2) for(int i = 1; argv[i]; ++i) pf(argv[i]);
return 0;
}
The program will try to prefetch any filename given on the commandline.
The code isn't overly pretty but it works. It uses ANSI filenames, and leaks a file handle in case opening succeeds but mapping fails (but bleh... it's not really a problem, the OS will clean up after the program exits -- if that annoys you, wrap the handles in RAII). It's also limited to ca. 1.8GiB file size due to address space in a 32-bit build, otherwise limited to 4GiB due to GetFileSize, but that's also trivial to fix if you really need that big a file.
Instead of volatile one might want to return or otherwise consume the "result", but either way works (volatile does not truly have a measurable impact on performance, compared to a disk access!).
1Truth being told, you actually can't force Windows, but it incidentially always works that way unless you explicitly request unbuffered I/O.
In theory, you could force the OS to read pages into memory and even force it to keep them in RAM by locking the memory, but your working set quota (wich is very small, and you need aministrative rights to modify it) will not normally let you do this. That's a good thing though, since locking large amounts of memory is a very bad idea.

Resources