When MPI_Send buffer size is 100 program works, but it stucks when it is 1000 or greater. Why?
if(id == 0){
rgb_image = stbi_load(argv[1], &width, &height, &bpp, CHANNEL_NUM);
for(int i = 0; i < size -1; i++)
MPI_Send(rgb_image,1000,MPI_UINT8_T,i,0,MPI_COMM_WORLD);
}
uint8_t *part = (uint8_t*) malloc(sizeof(uint8_t)*(1000));
if(id != size-1 && size > 1)
MPI_Recv(part,1000,MPI_UINT8_T,0,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
This program is not valid w.r.t. MPI Standard since there is no matching receive (on rank 0) for
MPI_Send(..., dest=0, ...)
MPI_Send() is allowed to block until a matching receive is posted (and that generally happens when the message is "large") ... and the required matching receive never gets posted.
A typical fix would be to issue a MPI_Irecv(...,src = 0,...) on rank 0 before the MPI_Send() (and MPI_Wait() after), or to handle 0 -> 0 communication with MPI_Sendrecv().
That being said, it would likely more efficient to create a communicator will all the ranks minus the last one, and MPI_Bcast() in this communicator.
If a program works for small buffers but not for large, you are probably running into "eager sends". Normally, a send & receive transaction involves the sender & receiver talking back and forth, confirming that the data went across. This is overhead, so for small messages, many MPIs will just send the data, without confirmation. The data then goes into some secret buffer on the receiver.
But this means that your program will "succeed" if it's not a correct program. As is the case here. See #Giles answer.
Related
I've been away from parallel programming for a long period of time and I am trying to figure out the best method for coordinating sending large amounts of data between many processors with a complicated dependency structure. For example, I might to send data to/from the following processes:
int process_1_dependencies[] = {2,3,5,6}
int process_2_dependencies[] = {1}
int process_3_dependencies[] = {1,4,5}
int process_4_dependencies[] = {3,5,6}
int process_5_dependencies[] = {1,3,4,6}
int process_6_dependencies[] = {1,4,5,7}
int process_7_dependencies[] = {6,8}
int process_8_dependencies[] = {7}
The obvious, and stupid, way of doing this would be do something like:
for(int i = 0; i < world_size; i++)
{
for(int j = 0; j < dependency_length; j++)
{
if (i == my_rank)
{
mpi_irecv(...,source=dependency[j],)
}
else
{
if (i == dependency[j])
{
mpi_isend(...,dest=dependency[j])
}
}
}
// blocking stuff?
}
I'm not actually sure if this would work once you have 100's of communications going and in anycase, it seems super inefficient. It's at least O(N) and only allows a single process to be receiving at once. A better way would be to use blocking and ensure that independent processes are simultaneously exchanging information. But that becomes quite complicated and requires optimizing which processes are simultaneously sending and receiving.
Am I just completely overthinking this? Is it safe to do something like this (provided that every sending process has a receiving pair):
for(int i = 0; i < dependency_length; i++)
{
mpi_isend(..., dest=dependency[i], ...)
mpi_irecv(..., source=dependency[i], ...)
}
//blocking stuff
sorry for the lack of focus in the question. I'm away from my computer so I can't really test it out, and in even if it did would I guess I'm not confident that it is saleable and that the buffers would keep working for arbitrary numbers of processes?
To avoid queueing a large number of messages and to avoid opaque deadlock problems, you can also employ a single call to MPI_Alltoallv, where all sends and receives are done for you automatically, and---with crossed fingers--- even hope that you MPI implemetation is able to optimize all communication on its own. The prototype is
MPI_Alltoallv
(
sendbuf, // buffer containing all data needed by other ranks in comm
sendcounts, // number of elements to send to each rank in comm
sdispls, // offsets in sendbuf per rank in comm
sendtype, // MPI datatype of the sent data
recvbuf, // buffer to contain all data needed by this rank
recvcounts, // number of elements to receive per rank in comm
rdispls, // offsets in recvbuf per rank in comm
recvtype, // MPI datatype of the received data
comm // the communicator
);
where sendcounts would be directly related to your process_X_dependencies; it would contain non-zero values at positions listed by process_X_dependencies.
So I've got N asynchronous, timestamped data streams. Each stream has a fixed-ish rate. I want to process all of the data, but the catch is that I must process the data in order as close to the time that the data arrived as possible (it is a real-time streaming application).
So far, my implementation has been to create a fixed window of K messages which I sort by timestamp using a priority queue. I then process the entirety of this queue in order before moving on to the next window. This is okay, but its less than ideal because it creates lag proportional to the size of the buffer, and also will sometimes lead to dropped messages if a message arrives just after the end of the buffer has been processed. It looks something like this:
// Priority queue keeping track of the data in timestamp order.
ThreadSafeProrityQueue<Data> q;
// Fixed buffer size
int K = 10;
// The last successfully processed data timestamp
time_t lastTimestamp = -1;
// Called for each of the N data streams asyncronously
void receiveAsyncData(const Data& dat) {
q.push(dat.timestamp, dat);
if (q.size() > K) {
processQueue();
}
}
// Process all the data in the queue.
void processQueue() {
while (!q.empty()) {
const auto& data = q.top();
// If the data is too old, drop it.
if (data.timestamp < lastTimestamp) {
LOG("Dropping message. Too old.");
q.pop();
continue;
}
// Otherwise, process it.
processData(data);
lastTimestamp = data.timestamp;
q.pop();
}
}
Information about the data: they're guaranteed to be sorted within their own stream. Their rates are between 5 and 30 hz. They consist of images and other bits of data.
Some examples of why this is harder than it appears. Suppose I have two streams, A and B both running at 1 Hz and I get the data in the following order:
(stream, time)
(A, 2)
(B, 1.5)
(A, 3)
(B, 2.5)
(A, 4)
(B, 3.5)
(A, 5)
See how if I processed the data in order of when I received them, B would always get dropped? that's what I wanted to avoid.Now in my algorithm, B would get dropped every 10th frame, and I would process the data with a lag of 10 frames into the past.
I would suggest a producer/consumer structure. Have each stream put data into the queue, and a separate thread reading the queue. That is:
// your asynchronous update:
void receiveAsyncData(const Data& dat) {
q.push(dat.timestamp, dat);
}
// separate thread that processes the queue
void processQueue()
{
while (!stopRequested)
{
data = q.pop();
if (data.timestamp >= lastTimestamp)
{
processData(data);
lastTimestamp = data.timestamp;
}
}
}
This prevents the "lag" that you see in your current implementation when you're processing a batch.
The processQueue function is running in a separate, persistent thread. stopRequested is a flag that the program sets when it wants to shut down--forcing the thread to exit. Some people would use a volatile flag for this. I prefer to use something like a manual reset event.
To make this work, you'll need a priority queue implementation that allows concurrent updates, or you'll need to wrap your queue with a synchronization lock. In particular, you want to make sure that q.pop() waits for the next item when the queue is empty. Or that you never call q.pop() when the queue is empty. I don't know the specifics of your ThreadSafePriorityQueue, so I can't really say exactly how you'd write that.
The timestamp check is still necessary because it's possible for a later item to be processed before an earlier item. For example:
Event received from data stream 1, but thread is swapped out before it can be added to the queue.
Event received from data stream 2, and is added to the queue.
Event from data stream 2 is removed from the queue by the processQueue function.
Thread from step 1 above gets another time slice and item is added to the queue.
This isn't unusual, just infrequent. And the time difference will typically be on the order of microseconds.
If you regularly get updates out of order, then you can introduce an artificial delay. For example, in your updated question you show messages coming in out of order by 500 milliseconds. Let's assume that 500 milliseconds is the maximum tolerance you want to support. That is, if a message comes in more than 500 ms late, then it will get dropped.
What you do is add 500 ms to the timestamp when you add the thing to the priority queue. That is:
q.push(AddMs(dat.timestamp, 500), dat);
And in the loop that processes things, you don't dequeue something before its timestamp. Something like:
while (true)
{
if (q.peek().timestamp <= currentTime)
{
data = q.pop();
if (data.timestamp >= lastTimestamp)
{
processData(data);
lastTimestamp = data.timestamp;
}
}
}
This introduces a 500 ms delay in the processing of all items, but it prevents dropping "late" updates that fall within the 500 ms threshold. You have to balance your desire for "real time" updates with your desire to prevent dropping updates.
There's always be a lag and that lag will be determined by how long you'll be willing to wait for your slowest "fixed-ish rate" stream.
Suggestion:
keep the buffer
keep an array of bool flags with the meaning:"if position ix is true, in the buffer there is at least a sample originated from stream ix"
sort/process as soon as you have all flag to true
Not full-proof (each buffer will be sorted, but from one buffer to another you may have timestamp inversion), but perhaps good enough?
Playing around with the count of "satisfied" flags to trigger the processing (at step 3) may be used to make the lag smaller, but with the risk of more inter-buffer timestamp inversions. In extreme, accepting the processing with only one satisfied flag means "push a frame as soon as you receive it, timestamp sorting be damned".
I mentioned this to support my feeling that lag/timestamp inversions balance is inherent to your problem - except for absolutely equal framerates, there will be perfect solution in which one of the sides is not sacrificed.
Since a "solution" will be an act of balancing, any solution will require gathering/using extra information to help decisions (e.g. that "array of flags"). If what I suggested sounds silly for your case (may well be, the details you chose to share aren't too many), start thinking what metrics will be relevant for your targeted level of "quality of experience" and use additional data structures to help gathering/processing/using those metrics.
Lets just say I want to fragment some data units into packets (max size per packet is lets say 1024 bytes). Each data unit can be of variable size, say:
a = 20 bytes
b = 1000 bytes
c = 10 bytes
d = 800 bytes
Can anyone please suggest any efficient algorithm to create packets with such random data efficiently utilizing the bandwidth? I cannot split the individual data units into bytes...they go whole inside a packet.
EDIT: The ordering of data units is of no concern!
There are several different ways, depending on your requirements and how much time you want to spend on it. The general problem, as #amit mentioned in comments, is NP-Hard. But you can get some improvement with some simple changes.
Before we go there, are you sure you really need to do this? Most networking layers have a packet-sized (or larger) buffer. When you write to the network, it puts your data in that buffer. If you don't fill the buffer completely, the code will delay briefly before sending. If you add more data during that delay, the new data is added to the buffer. The buffer is sent once it fills, or after the delay timeout expires.
So if you have a loop that writes one byte at a time to the network, it's not like you'll be creating a large number of one-byte packets.
On the receiving side, the lowest level networking layer receives an entire packet, but there's no guarantee that your call to receive the data will get the entire packet. That is, the sender might send an 800 byte packet, but on the receiving end the first call to read might only return 50 or 273 bytes.
This depends, of course, at what level you're reading the data. If you're talking about something like Java or .NET, where your interface to the network stack is through a socket, you almost certainly can't guarantee that a call to socket.Read() will return an entire packet.
Now, if you can guarantee that every call to read returns an entire packet, then the easiest way to pack things would be to serialize everything into one big buffer and then send it out in multiple 1,024-byte packets. You'll want to create a header at the front of the first packet that says how many total bytes will be sent, so the receiver knows what to expect. The result will be a bunch of 1,024-byte packets, potentially followed by a final packet that is somewhat smaller.
If you want to make sure that a data object is fully contained within a single packet, then you have to do something like:
add a to buffer
if remaining buffer < size of b
send buffer
clear buffer
add b to buffer
if remaining buffer < size of c
send buffer
clear buffer
add c to buffer
... etc ...
Here's some simple JavaScript pseudo code. The packets will stay ordered and the bandwidth will be used optimally.
packets = [];
PACKET_SIZE = 1024;
currentPacket = [];
function write(data) {
var len = currentPacket.length + data.length;
if(len < PACKET_SIZE) {
currentPacket = currentPacket.concat(data);
} else if(len === PACKET_SIZE) {
packets.push(currentPacket.concat(data));
currentPacket = [];
} else { // if(len > PACKET_SIZE) {
packets.push(currentPacket);
currentPacket = data;
}
}
function flush() {
if(currentPacket.length > 0) {
packets.push(currentPacket);
currentPacket = [];
}
}
write(data20bytes);
write(data1000bytes);
write(data10bytes);
write(data800bytes);
flush();
EDIT Since you have all of the data chunks and you want to optimally package them out of order (bin packing) then you left with trying every permutation of the chunks for an exact answer or compromising with an best guess/first fit type algorithm.
I'm trying to solve the exercice 5.10 of the book
"Foundations of Multithreaded, Parallel, and Distributed Programming".
The exercice is
"Assume one producer process and N consumer processes share a bounded buffer having B slots. The producer deposits messages in the buffer; consumers fetch them. Every message deposited by the producer is to be received by all N consumers. Futthermore, each consumer is to receive the messages in the order theu were deposited. However, consumers can receive messages at different times. For example, one consumer could receive up to B more messages than another if the second consumer is slow.
Develop a monitor that implements this kind of communication. Use Signal and Continue discipline."
Can someone help me, please?
Thank you very much!
--
EDIT:
I'm commenting now what I already made (cause I thought that the question was very big if I wrote everything that).
/* creating a buffer of B positions. */
global buffer[B];
Monitor {
cond ok_write;
cond ok_read;
int stamp_buffer[B] = [0, 0, .., 0]
request_write (int pos){
if (stamp_buffer[pos] > 0)
wait(ok_write);
write_message (buufer[pos]);
stamp_buffer[pos] = N;
signalAll (ok_read);
}
request_read (int pos){
if (stamp_buffer[pos] == 0)
wait (ok_read);
stamp_buffer[pos] --;
}
release_read (int pos){
if (stamp_buffer[pos]==0)
signal(ok_write);
}
}
So, I think that I still have that problem: "A reader can read the same message two times."
The basic idea of my algorithm is:
The writer write in a position "pos" and set the value of stamp[pos] to N.
Then, when each reader read the position pos, it do stamp[pos] - 1.
So, if stamp[pos] is zero, the message buffer[pos] was already readed N times and the writer can write in this position again.
But, if some reader read a message two times (or more), the writer can wirte a new message in the position pos and some reader will not read the old message.
At the moment I'm filling an std::vector with all of my data and then sending it with async_write. All of the packets I send have a 2 byte header and this tells receiver how much further to read (if any further at all). The code which generates this std::vector is:
std::vector<boost::asio::const_buffer> BasePacket::buffer()
{
std::vector<boost::asio::const_buffer> buffers;
buffers.push_back(boost::asio::buffer(headerBytes_)); // This is just a boost::array<uint8_t, 2>
return buffers;
}
std::vector<boost::asio::const_buffer> UpdatePacket::buffer()
{
printf("Making an update packet into a buffer.\n");
std::vector<boost::asio::const_buffer> buffers = BasePacket::buffer();
boost::array<uint16_t, 2> test = { 30, 40 };
buffers.push_back(boost::asio::buffer(test));
return buffers;
}
This is read by:
void readHeader(const boost::system::error_code& error, size_t bytesTransferred)
{
if(error)
{
printf("Error reading header: %s\n", error.message().c_str());
return;
}
// At this point 2 bytes have been read into boost::array<uint8_t, 2> header
uint8_t primeByte = header.data()[0];
uint8_t supByte = header.data()[1];
switch(primeByte)
{
// Unrelated case removed
case PACKETHEADER::UPDATE:
// Read the first 4 bytes as two 16-bit numbers representing the size of
// the update
boost::array<uint16_t, 2> buf;
printf("Attempting to read the first two Uint16's.\n");
boost::asio::read(mySocket, boost::asio::buffer(buf));
printf("The update has size %d x %d\n", buf.data()[0], buf.data()[1]);
break;
}
// Keep listening
boost::asio::async_read(mySocket, boost::asio::buffer(header),
boost::bind(readHeader, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred));
}
The code compiles, however it doesn't return 30 x 40 as I would expect. Instead it returns
188 x 40. If I stretch the second array out only the first byte is messed up. However, if I add a third array before sending (but still read the send amount), the values of the second array all get messed up. I'm guessing that this could be related to how I'm reading it (in chunks into one buffer rather than similar to how I'm writing it).
Ideally I'd like to avoid having to cast everything into bytes and read/write that way, since it's less clear and probably less portable, but I know that's an option. However, if there is a better way I'm fine rewriting what I have.
The first problem I see is a lifetime issue with the data you are sending. asio::buffers simply wrap a data buffer that you continue to own.
The UpdatePacket::buffer() method creates a boost::array which it wraps and then pushes back on the buffers std::vector. When the method exits the boost::array goes out of scope and the asio::buffer is now pointing to garbage.
There maybe other issues, but this is a good start. Mind the lifetimes of your data buffers in Asio.