MPI_Scatter redundant parameters?

MPI_Scatter redundant parameters? - parallel-processing

My question is rather simple, the MPI_Scatter function definition is:
#include <mpi.h>
void MPI::Comm::Scatter(const void* sendbuf, int sendcount,
const MPI::Datatype& sendtype, void* recvbuf,
int recvcount, const MPI::Datatype& recvtype,
int root) const
Are 'sendcount' and 'sendtype' redundant?
In which case it can happen: sendcount!=recvcount?
Edit:
Maybe some clarification is needed about the question. I understand that maybe the reason is that, for the root the data is some 'struct X' and for the receivers is some 'struct Y' that somehow it also makes sense (it all fits 'Ok').
If that's the case... I don't get why is needed to say again that the total size of the expected data to receive, is the same of the sended data size. If it's just a matter of casting the view of the data, I'd only do the cast. In fact, the buffer is a (void *).

MPI allows for both datatypes on the sending and on the receiving end to be different as long as they are constructed from the same basic datatypes. Thare are many cases where this comes handy, e.g. scattering rows of a matrix from the root process into columns in the other processes. Sending and receiving rows is straightforward in C and C++ as the memory layout of the matrices is row-major. Sending and receiving columns requres that a special strided vector type is constructed first. Usually this type is constructed for a specified number of rows and columns and then one has to supply a count of 1 when receiving the data.
There are also many other cases when sendcount and recvcount might differ. Mind also that recvcount does not specify the size of the message to be received but rather the capacity of the receive buffer and that capacity may be way larger than the size of the message.

MPI_scatter() is for break the message in equal piezes and process each one in the child nodes and in your own. Knowing this:
Are 'sendcount' and 'sendtype' redundant?
-How can that happen?, if sendCount is the number of elements sent, and sendType is the type of those elements. Both contains different information.
And for the last question:
In which case it can happen: sendcount!=recvcount?.
-When you want to sort a sequence of numbers, you send blocks of size N and type=int to your nodes. You want the same but sorted.

Related

Determining whether a C struct is packed or not

I'm extracting C struct layout from and executable using gdb-python.
I manage to get all the fields, offsets, types & sizes.
Still, when trying to re-generate the struct's code, I do not have any indication for whether it was marked with GCC's attribute((__packed__)).
Is there any way to get this information from the executable? (preferably using gdb-python, but any other way will do too)

Is there any way to get this information from the executable?
No, but you should be able to deduce this with a simple heuristic:
if sizeof(struct foo) is greater than the sum of its member field sizes, the struct is not packed.
if sizeof(struct foo) is equal to the sum of its member field sizes, the struct is either packed, or its members are naturally aligned with no holes, and packing doesn't matter for it.

How does gc handle slice memory reclaim

var a = [...]int{1,2,3,4,5,6}
s1 := a[2:4:5]
Suppose s1 goes out of scope later than a. How does gc know to reclaim the memory of s1's underlying array a?
Consider the runtime representation of s1, spec
type SliceHeader struct {
Data uintptr
Len int
Cap int
}
The GC doesn't even know about the beginning of a.

Go uses mark-and-sweep collector as it's present implementation.
As per the algorithm, there will be one root object, and the rest is tree like structure, in case of multi-core machines gc runs along with the program on one core.
gc will traverse the tree and when something is not reachable it, considers it as free.
Go objects also have metadata for objects as stated in this post.
An excerpt:
We needed to have some information about the objects since we didn't have headers. Mark bits are kept on the side and used for marking as well as allocation. Each word has 2 bits associated with it to tell you if it was a scalar or a pointer inside that word. It also encoded whether there were more pointers in the object so we could stop scanning objects sooner than later.
The reason go's slices (slice header) were structures instead of pointer to structures is documented by russ cox in this page under slice section.
This is an excerpt:
Go originally represented a slice as a pointer to the structure(slice header) , but doing so meant that every slice operation allocated a new memory object. Even with a fast allocator, that creates a lot of unnecessary work for the garbage collector, and we found that, as was the case with strings, programs avoided slicing operations in favor of passing explicit indices. Removing the indirection and the allocation made slices cheap enough to avoid passing explicit indices in most cases.
The size(length) of an array is part of its type. The types [1]int and [2]int are distinct.
One thing to remember is go is value oriented language, instead of storing pointers, they store direct values.
[3]int, arrays are values in go, so if you pass an array, it copies the whole array.
[3]int this is a value (one as a whole).
When one does a[1] you are accessing part of the value.
SliceHeader Data field says consider this as base point of array, instead of a[0]
As far as my knowledge is considered:
When one requests for a[4],
a[0]+(sizeof(type)*4)
is calculated.
Now if you are accessing something in through slice s = a[2:4],
and if one requests for s[1], what one was requesting is,
a[2]+sizeof(type)*1

Performance of std::vector<Test> vs std::vector<Test*>

In an std::vector of a non POD data type, is there a difference between a vector of objects and a vector of (smart) pointers to objects? I mean a difference in the implementation of these data structures by the compiler.
E.g.:
class Test {
std::string s;
Test *other;
};
std::vector<Test> vt;
std::vector<Test*> vpt;
Could be there no performance difference between vt and vpt?
In other words: when I define a vector<Test>, internally will the compiler create a vector<Test*> anyway?

In other words: when I define a vector, internally will the compiler create a vector anyway?
No, this is not allowed by the C++ standard. The following code is legal C++:
vector<Test> vt;
Test t1; t1.s = "1"; t1.other = NULL;
Test t2; t2.s = "1"; t2.other = NULL;
vt.push_back(t1);
vt.push_back(t2);
Test* pt = &vt[0];
pt++;
Test q = *pt; // q now equal to Test(2)
In other words, a vector "decays" to an array (accessing it like a C array is legal), so the compiler effectively has to store the elements internally as an array, and may not just store pointers.
But beware that the array pointer is valid only as long as the vector is not reallocated (which normally only happens when the size grows beyond capacity).

In general, whatever the type being stored in the vector is, instances of that may be copied. This means that if you are storing a std::string, instances of std::string will be copied.
For example, when you push a Type into a vector, the Type instance is copied into a instance housed inside of the vector. The copying of a pointer will be cheap, but, as Konrad Rudolph pointed out in the comments, this should not be the only thing you consider.
For simple objects like your Test, copying is going to be so fast that it will not matter.
Additionally, with C++11, moving allows avoiding creating an extra copy if one is not necessary.
So in short: A pointer will be copied faster, but copying is not the only thing that matters. I would worry about maintainable, logical code first and performance when it becomes a problem (or the situation calls for it).
As for your question about an internal pointer vector, no, vectors are implemented as arrays that are periodically resized when necessary. You can find GNU's libc++ implementation of vector online.
The answer gets a lot more complicated at a lower than C++ level. Pointers will of course have to be involved since an entire program cannot fit into registers. I don't know enough about that low of level to elaborate more though.

Use cases of std::multimap

I don't quite get the purpose of this data structure. What's the difference between std::multimap<K, V> and std::map<K, std::vector<V>>. The same goes for std::multiset- it could just be std::map<K, int> where the int counts the number of occurrences of K. Am I missing something on the uses of these structures?

A counter-example seems to be in order.
Consider a PhoneEntry in an AdressList grouped by name.
int AdressListCompare(const PhoneEntry& p1, const PhoneEntry& p2){
return p1.name<p2.name;
}
multiset<PhoneEntry, AdressListCompare> adressList;
adressList.insert( PhoneEntry("Cpt.G", "123-456", "Cellular") );
adressList.insert( PhoneEntry("Cpt.G", "234-567", "Work") );
// Getting the entries
addressList.equal_range( PhoneENtry("Cpt.G") ); // All numbers
This would not be feasible with a set+count. Your Object+count approach seems to be faster if this behavior is not required. For instance the multiset::count() member states
"Complexity: logarithmic in size +
linear in count."

You could use make the substitutions that you suggest, and extract similar behavior. But the interfaces would be very different than when dealing with regular standard containers. A major design theme of these containers is that they share as much interface as possible, making them as interchangeable as possible so that the appropriate container can be chosen without having to change the code that uses it.
For instance, std::map<K, std::vector<V>> would have iterators that dereference to std::pair<K, std::vector<V>> instead of std::pair<K, V>. std::map<K, std::vector<V>>::Count() wouldn't return the correct result, failing to account for the duplicates in the vector. Of course you could change your code to do the extra steps needed to correct for this, but now you are interfacing with the container in a much different way. You can't later drop in unordered_map or some other map implementation to see it performs better.
In a broader sense, you are breaking the container abstraction by handling container implementation details in your code rather than having a container that handles it's own business.
It's entirely possible that your compiler's implementation of std::multimap is really just a wrapper around std::map<K, std::vector<V>>. Or it might not be. It could be more efficient and friendly to object pool allocation (which vectors are not).
Using std::map<K, int> instead of std::multiset is the same case. Count() would not return the expected value, iterators will not iterate over the duplicates, iterators will dereference to std::pair<k, int> instead of directly to `K.

A multimap or multiset allows you to have elements with duplicate keys.
ie a set is a non-ordered group of elements that are all unique in that {A,B,C} == {B,C,A}

Adaptive IO Optimization Problem

Here is an interesting optimization problem that I think about for some days now:
In a system I read data from a slow IO device. I don't know beforehand how much data I need. The exact length is only known once I have read an entire package (think of it as it has some kind of end-symbol). Reading more data than required is not a problem except that it wastes time in IO.
Two constrains also come into play: Reads are very slow. Each byte I read costs. Also each read-request has a constant setup cost regardless of the number of bytes I read. This makes reading byte by byte costly. As a rule of thumb: the setup costs are roughly as expensive as a read of 5 bytes.
The packages I read are usually between 9 and 64 bytes, but there are rare occurrences larger or smaller packages. The entire range will be between 1 to 120 bytes.
Of course I know a little bit of my data: Packages come in sequences of identical sizes. I can classify three patterns here:
Sequences of reads with identical sizes:
A A A A A ...
Alternating sequences:
A B A B A B A B ...
And sequences of triples:
A B C A B C A B C ...
The special case of degenerated triples exist as well:
A A B A A B A A B ...
(A, B and C denote some package size between 1 and 120 here).
Question:
Based on the size of the previous packages, how do I predict the size of the next read request? I need something that adapts fast, uses little storage (lets say below 500 bytes) and is fast from a computational point of view as well.
Oh - and pre-generating some tables won't work because the statistic of read sizes can vary a lot with different devices I read from.
Any ideas?

You need to read at least 3 packages and at most 4 packages to identify the pattern.
Read 3 packages. If they are all same size, then the pattern is AAAAAA...
If they are all not the same size, read the 4th package. If 1=3 & 2=4, pattern is ABAB. Otherwise, pattern is ABCABC...
With that outline, it is probably a good idea to do a speculative read of 3 package sizes (something like 3*64 bytes at a single go).

I don't see a problem here.. But first, several questions:
1) Can you read the input asyncronously (e.g. separate thread, interrupt routine, etc)?
2) Do you have some free memory for a buffer?
3) If you've commanded a longer read, are you able to obtain first byte(s) before the whole packet is read?
If so (and I think in most cases it can be implemented), then you can just have a separate thread that reads them at highest possible speed and stores them in a buffer, with stalling when the buffer gets full, so that you normal process can use a synchronous getc() on that buffer.
EDIT: I see.. it's because of CRC or encryption? Well, then you could use some ideas from data compression:
Consider a simple adaptive algorithm of order N for M possible symbols:
int freqs[M][M][M]; // [a][b][c] : occurences of outcome "c" when prev vals were "a" and "b"
int prev[2]; // some history
int predict(){
int prediction = 0;
for (i = 1; i < M; i++)
if (freqs[prev[0]][prev[1]][i] > freqs[prev[0]][prev[1]][prediction])
prediction = i;
return prediction;
};
void add_outcome(int val){
if (freqs[prev[0]][prev[1]][val]++ > DECAY_LIMIT){
for (i = 0; i < M; i++)
freqs[prev[0]][prev[1]][i] >>= 1;
};
pred[0] = pred[1];
pred[1] = val;
};
freqs has to be an array of order N+1, and you have to remember N previsous values. N and DECAY_LIMIT have to be adjusted according to the statistics of the input. However, even they can be made adaptive (for example, if it producess too many misses, then the decay limit can be shortened).
The last problem would be the alphabet. Depending on the context, if there are several distinct sizes, you can create a one-to-one mapping to your symbols. If more, then you can use quantitization to limit the number of symbols. The whole algorithm can be written with pointer arithmetics, so that N and M won't be hardcoded.

Since reading is so slow, I suppose you can throw some CPU power at it so you can try to make an educated guess of how much to read.
That would be basically a predictor, that would have a model based on probabilities. It would generate a sample of predictions of the upcoming message size, and the cost of each. Then pick the message size that has the best expected cost.
Then when you find out the actual message size, use Bayes rule to update the model probabilities, and do it again.
Maybe this sounds complicated, but if the probabilities are stored as fixed-point fractions you won't have to deal with floating-point, so it may be not much code. I would use something like a Metropolis-Hastings algorithm as my basic simulator and bayesian update framework. (This is just an initial stab at thinking about it.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

MPI_Scatter redundant parameters? - parallel-processing

Related

Determining whether a C struct is packed or not

How does gc handle slice memory reclaim

Performance of std::vector<Test> vs std::vector<Test*>

Use cases of std::multimap

Adaptive IO Optimization Problem

Categories

Resources