MPI_ALLreduce with Fortran and 2 bytes integer - memory-management

I'm trying to do an MPI sum of 2 bytes integer:
INTEGER, PARAMETER :: SIK2 = SELECTED_INT_KIND(2)
INTEGER(SIK2) :: s_save(dim)
Indeed its an array which takes integer values from 1 to 48 max, so 2 bytes is enough for memory reasons.
Therefore I tried the following:
CALL MPI_TYPE_CREATE_F90_INTEGER(SIK2, int2type, ierr)
CALL MPI_ALLreduce(MPI_IN_PLACE, s_save, nkpt_in, int2type, MPI_SUM, world_comm, ierr)
This works well for Gfortran + openmpi.
However in the case of intel I get a crash:
MPI_Allreduce(1000)......: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x55d2160, count=987, dtype=USER<f90_integer>, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_SUM_check_dtype(106): MPI_Op MPI_SUM operation not defined for this datatype
Is there a proper (or recommended) way to do this so that it works for most compilers?

Related

Golang readers: Why writing int64 numbers using bitwise operator <<

I have come across the following code when dealing with Go readers to limit the number of bytes read from a remote client when sending a file through multipart upload (e.g. in Postman).
r.Body = http.MaxBytesReader(w, r.Body, 32<<20+1024)
If I am not mistaken, the above notation should represent 33555456 bytes, or 33.555456 MB (32 * 2 ^ 20) + 1024. Or is this number not correct?
What I don't understand is:
why did the author use it like this? Why using 20 and not some other number?
why the author used the notation +1024 at all? Why didn't he write 33 MB instead?
would it be OK to write 33555456 directly as int64?
If I am not mistaken, the above notation should represent 33555456 bytes, or 33.555456 MB (32 * 2 ^ 20) + 1024. Or is this number not correct?
Correct. You can trivially check it yourself.
fmt.Println(32<<20+1024)
Why didn't he write 33 MB instead?
Because this number is not 33 MB. 33 * 1024 * 1024 = 34603008
would it be OK to write 33555456 directly as int64?
Naturally. That's what it likely is reduced to during compilation anyway. This notation is likely easier to read, once you figure out the logic behind 32, 20 and 1024.
Ease of reading is why I almost always (when not using ruby) write constants like "50 MB" as 50 * 1024 * 1024 and "30 days" as 30 * 86400, etc.

How to maximize data transfer speed over USB (configured as virtual com port)

I have troubles to get my streaming over OTG-USB-FS configured as VCP. In my disposition I have nucleo-h743zi board that seems to doing a good job at sending me data, but on PC side I have a problem to receive that data.
for(;;) {
#define number_of_ccr 1024
unsigned int lpBuffer[number_of_ccr] = {0};
unsigned long nNumberOfBytesToRead = number_of_ccr*4;
unsigned long lpNumberOfBytesRead;
QueryPerformanceCounter(&startCounter);
ReadFile(
hSerial,
lpBuffer,
nNumberOfBytesToRead,
&lpNumberOfBytesRead,
NULL
);
if(!strcmp(lpBuffer, "end\r\n")) {
CloseHandle(FileHandle);
fprintf(stderr, "end flag was received\n");
break;
}
else if(lpNumberOfBytesRead > 0) {
// NOTE(): succeed
QueryPerformanceCounter(&endCounter);
time = Win32GetSecondsElapsed(startCounter, endCounter);
char *copyString = "copy";
WriteFile(hSerial, copyString , strlen(copyString), &bytes_written, NULL);
DWORD BytesWritten;
// write data to file
WriteFile(FileHandle, lpBuffer, nNumberOfBytesToRead, &BytesWritten, 0);
}
}
QPC shows that speed was 0.00733297970 - it's one time for one successful data block transfer (1024*4 bytes).
this is the Listener code, I bet that this is not how it should be done, so I here to seek advices. I was hopping that maybe full streaming without control sequences ("copy") will be possible, but in that case I can't receive adjacent data (within one transfer block it's OKAY, but two consecutive received blocks aren't adjacent.
Example:
block_1: 1 2 3 4 5 6
block_2: 13 14 15 16 17 18
Is there any way to speed up my receiving?
(I was trying O2 key without any success)
You need to configure buffer on PC side that will be 2 or 3 times the buffer you are transfer from your board, and use something like double buffer scheme for transferring the data. You transfer the first buffer while filing the second, then alternate.
Good thing to do is to activate caches, and place the buffers in fast memory for stm32h7 (it's 1 domain RAM).
But if your interface do not match the speed you needed, there will be no tricks to do this. Except maybe one, if your controller is fast enough -> you can implement and use lossless data compression on that data of yours and transfer compressed files. If you transmit low entropy data, this could give you a solid boost in speed.

Understand `openssl speed`

I ran openssl speed rsa512 and it shows me how many signs and verifies it can do in a second. Unfortunalely, the test does not say anything about the message size, which is signed. Thus I digged into the openssl sources and found the following line in the speed.c:
ret = RSA_sign(NID_md5_sha1, buf, 36, buf2, rsa_num, rsa_key[testnum]);
Looking into the function in the rsa.h, I can see the following function declaration:
int RSA_sign(int type, const unsigned char *m, unsigned int m_length,
unsigned char *sigret, unsigned int *siglen, RSA *rsa);
I guess, m is the message and m_length is the length of the message.
Am I right that the message size is 36 byte in the RSA speed test?
The same goes for ECDSA, e.g., openssl speed ecdsap256. The speed.c uses the following line:
ret = ECDSA_sign(0, buf, 20, ecdsasig, ecdsasiglen, ecdsa[testnum]);
Am I right that the message size is 20 byte in the ECDSA speed test?
My Conclusion: It's not possible to compare them, since they sign different message lengths.
Asymmetric signatures, technically, don't sign messages. They sign hashes of messages.
Their rsa512 test is doing the RSA signature padding and transformation on an SSL "MD5 and SHA1" value (which is 16 + 20 = 36 bytes). So the number it produces is how many RSA pad-and-sign (and answer-copy) operations it can do, you need to divide that by the time it takes to hash the message.
Their ecdsap256 computation is assuming that the digest was SHA-1 (20 bytes). Again, you would take this number divided by the time it takes to hash a message.
Since they both are in scale terms of the data hashing they're comparable.

ARM linux kernel head-common.S

I was looking head-common.S
at the __mmap_switched:
.long init_thread_union + THREAD_START_SP # sp //for stack pointer
THREAD_START_SP is defined THREAD_SIZE(8192) - 8 in "thread+info.h"
set stack size 8KB(8129) and minus 8byte.
why minus 8byte?
i suspect, i think DA(decrement after) right?
The 8 bytes aligned is the requirement in APCS.
In APCS, the chapter 5.2.1 The Stack,
The stack must also conform to the following constraint at a public interface:
SP mod 8 = 0. The stack must be double-word aligned.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.abi/index.html

Ruby IO#read max length for single read

How can i determine the max length IO#read can get in a single read on the current platform?
irb(main):301:0> File.size('C:/large.file') / 1024 / 1024
=> 2145
irb(main):302:0> s = IO.read 'C:/large.file'
IOError: file too big for single read
That message comes from io.c, remain_size. It is emitted when the (remaining) size of the file is greater or equal to LONG_MAX. That value depends on the platform your Ruby has been compiled with.
At least in Ruby 1.8.7, the maximum value for Fixnums happens to be just half of that value (-1), so you could get the limit by
2 * 2 ** (1..128).to_a.find { | i | (1 << i).kind_of? Bignum } - 1
You should rather not rely on that.

Resources