Use MPI_Barrier() to improve performance and avoid buffer issues? - parallel-processing

Can I do something like this to improve performance and avoid buffer problems I am running into for higher iterations. MaxIterations = 6000
While(numberIterations<= MaxIterations)
{
MPI_Iprobe() -- check for incoming data
while(flagprobe !=0)
{
MPI_Recv() -- receive data
MPI_Iprobe() -- loop if more data
}
updateData() -- update myData
for(i=0;i<N;i++) MPI_Isend(request[i]) -- send request
for(i=0;i<N;i++) MPI_Wait(request[i]) --wait till request is complete
if(numberIterations = MaxIterations) { MPI_Barrier() }
numberIterations ++
}

Related

TIM2 module not ticking at 1us in STM8S103F3 controller

I created a program on STM8S103F3 to generate a delay in rage of micro seconds using TIM2 module, but the timer is not ticking as expected and when I tried to call 5 sec delay using it, it is giving around 3 sec delay. I'm using 16MHz HSI oscillator and timer pre-scalar is set to 16. please see my code below. Please help me to figure out what is wrong with my code.
void clock_setup(void)
{
CLK_DeInit();
CLK_HSECmd(DISABLE);
CLK_LSICmd(DISABLE);
CLK_HSICmd(ENABLE);
while(CLK_GetFlagStatus(CLK_FLAG_HSIRDY) == FALSE);
CLK_ClockSwitchCmd(ENABLE);
CLK_HSIPrescalerConfig(CLK_PRESCALER_HSIDIV1);
CLK_SYSCLKConfig(CLK_PRESCALER_CPUDIV1);
CLK_ClockSwitchConfig(CLK_SWITCHMODE_AUTO, CLK_SOURCE_HSI,
DISABLE, CLK_CURRENTCLOCKSTATE_ENABLE);
CLK_PeripheralClockConfig(CLK_PERIPHERAL_SPI, DISABLE);
CLK_PeripheralClockConfig(CLK_PERIPHERAL_I2C, DISABLE);
CLK_PeripheralClockConfig(CLK_PERIPHERAL_ADC, DISABLE);
CLK_PeripheralClockConfig(CLK_PERIPHERAL_AWU, DISABLE);
CLK_PeripheralClockConfig(CLK_PERIPHERAL_UART1, DISABLE);
CLK_PeripheralClockConfig(CLK_PERIPHERAL_TIMER1, DISABLE);
CLK_PeripheralClockConfig(CLK_PERIPHERAL_TIMER2, DISABLE);
}
void delay_us(uint16_t us)
{
volatile uint16_t temp;
TIM2_DeInit();
TIM2_TimeBaseInit(TIM2_PRESCALER_16, 2000); //Prescalar value 8,Timer clock 2MHz
TIM2_Cmd(ENABLE);
do{
temp = TIM2_GetCounter();
}while(temp < us);
TIM2_ClearFlag(TIM2_FLAG_UPDATE);
TIM2_Cmd(DISABLE);
}
void delay_ms(uint16_t ms)
{
while(ms--)
{
//delay_us(1000);
delay_us(1000);
}
}
It is better to use a 10us time base to round the delays. Well in order to achive a 10us timebase, if you use 16MHz master clock and you prescale TIM2 by 16, then you get a 1 us increment time, right? But we want TIM2 to overflow to generate an event named 16us event. Since we know that the timer will increment every 1us, if we use a reload value 65536 - 10 = 65526, this will give us a 10us overflow hence, 10us time base. If we are ok until here in delay code we'll just check the TIM2 update flag to know whther it has overflowed or not. See the example code snippet below.
// Set up it once since our time base is a fixed 10us
void setupTIM2(){
TIM2_DeInit();
TIM2_TimeBaseInit(TIM2_PRESCALER_16, 65526); //Prescalar value 16,Timer clock 1MHz
}
void delay_us(uint16_t us)
{
volatile uint16_t temp;
TIM2_Cmd(ENABLE);
const uint16_t count = us / 10; //Get the required counts for 10us time base
// Loop until the temp reaches the required count value
do{
while(TIM2_GetFlagStatus(TIM2_FLAG_UPDATE) == RESET); //Wait for the TIM2 to overflow
TIM2_ClearFlag(TIM2_FLAG_UPDATE); // Clear the overflow flag
temp++;
} while(temp < count);
TIM2_Cmd(DISABLE);
}
void delay_ms(uint16_t ms)
{
while(ms--)
{
//delay_us(1000);
delay_us(1000);
}
}
void main(void){
...
setupTIM2();
...
delay_ms(5000);
}

SPI implementation stuck on “while(!spi_is_tx_empty(WINC1500_SPI));”

I'm currently implementing a driver for the WINC1500 to be used with an ATMEGA32 MCU and it's getting stuck on this line of "while(!spi_is_tx_empty(WINC1500_SPI));". The code builds and runs but it won't clear what's inside in this function to proceed through my code and boot up the Wifi Module. I've been stuck on this problem for weeks now with no progress and don't know how to clear it.
static inline bool spi_is_tx_empty(volatile avr32_spi_t *spi)
{
// 1 = All Transmissions complete
// 0 = Transmissions not complete
return (spi->sr & AVR32_SPI_SR_TXEMPTY_MASK) != 0;
}
Here is my implementation of the SPI Tx/Rx function
void m2mStub_SpiTxRx(uint8_t *p_txBuf,
uint16_t txLen,
uint8_t *p_rxBuf,
uint16_t rxLen)
{
uint16_t byteCount;
uint16_t i;
uint16_t data;
// Calculate the number of clock cycles necessary, this implies a full-duplex SPI.
byteCount = (txLen >= rxLen) ? txLen : rxLen;
// Read / Transmit.
for (i = 0; i < byteCount; ++i)
{
// Wait for transmitter to be ready.
while(!spi_is_tx_ready(WINC1500_SPI));
// Transmit.
if (txLen > 0)
{
// Send data from the transmit buffer
spi_put(WINC1500_SPI, *p_txBuf++);
--txLen;
}
else
{
// No more Tx data to send, just send something to keep clock active.
// Here we clock out a don't care byte
spi_put(WINC1500_SPI, 0x00U);
// Not reading it back, not being cleared 16/1/2020
}
// Reference http://asf.atmel.com/docs/latest/avr32.components.memory.sdmmc.spi.example.evk1101/html/avr32_drivers_spi_quick_start.html
// Wait for transfer to finish, stuck on here
// Need to clear the buffer for it to be able to continue
while(!spi_is_tx_empty(WINC1500_SPI));
// Wait for transmitter to be ready again
while(!spi_is_tx_ready(WINC1500_SPI));
// Send dummy data to slave, so we can read something from it.
spi_put(WINC1500_SPI, 0x00U); // Change dummy data from 00U to 0xFF idea
// Wait for a complete transmission
while(!spi_is_tx_empty(WINC1500_SPI));
// Read or throw away data from the slave as required.
if (rxLen > 0)
{
*p_rxBuf++ = spi_get(WINC1500_SPI);
--rxLen;
}
else
{
spi_get(WINC1500_SPI);
}
}
Debug output log
Disable SPI
Init SPI module as master
Configure SPI and Clock settings
spi_enable(WINC1500_SPI)
InitStateMachine()
INIT_START_STATE
InitStateMachine()
INIT_WAIT_FOR_CHIP_RESET_STATE
m2mStub_PinSet_CE
m2mStub_PinSet_RESET
m2mStub_GetOneMsTimer();
SetChipHardwareResetState (CHIP_HARDWARE_RESET_FIRST_DELAY_1MS)
InitStateMachine()
INIT_WAIT_FOR_CHIP_RESET_STATE
if(m2m_get_elapsed_time(startTime) >= 2)
m2mStub_PinSet_CE(M2M_WIFI_PIN_HIGH)
startTime = m2mStub_GetOneMsTimer();
SetChipHardwareResetState(CHIP_HARDWARE_RESET_SECOND_DELAY_5_MS);
InitStateMachine()
INIT_WAIT_FOR_CHIP_RESET_STATE
m2m_get_elapsed_time(startTime) >= 6
m2mStub_PinSet_RESET(M2M_WIFI_PIN_HIGH)
startTime = m2mStub_GetOneMsTimer();
SetChipHardwareResetState(CHIP_HARDWARE_RESET_FINAL_DELAY);
InitStateMachine()
INIT_WAIT_FOR_CHIP_RESET_STATE
m2m_get_elapsed_time(startTime) >= 10
SetChipHardwareResetState(CHIP_HARDWARE_RESET_COMPLETE)
retVal = true // State machine has completed successfully
g_scanInProgress = false
nm_spi_init();
reg = spi_read_reg(NMI_SPI_PROTOCOL_CONFIG)
Wait for a complete transmission
Wait for transmitter to be ready
SPI_PUT(WINC1500_SPI, *p_txBuf++);
--txLen;
Wait for transfer to finish, stuck on here
Wait for transfer to finish, stuck on here
The ATmega32 is an 8-bit AVR but you seem to be using code for the AVR32, a family of 32-bit AVRs. You're probably just using the totally wrong code and you should consult the datasheet of the ATmega32, and search for SPI for the AVR ATmega family.

c: socketCAN connection: read() not fast enough

socketCAN connection: read() not fast enough
Hello,
I use the socket() connection for my CAN communication.
fd = socket(PF_CAN, SOCK_RAW, CAN_RAW);
I'm using 2 threads: one periodic 1ms RT thread to send data and one
thread to read the incoming messages. The read function looks like:
void readCan0Socket(void){
int receivedBytes = 0;
do
{
// set GPIO pin low
receivedBytes = read(fd ,
&receiveCanFrame[recvBufferWritePosition],
sizeof(struct can_frame));
// reset GPIO pin high
if (receivedBytes != 0)
{
if (receivedBytes == sizeof(struct can_frame))
{
recvBufferWritePosition++;
if (recvBufferWritePosition == CAN_MAX_RECEIVE_BUFFER_LENGTH)
{
recvBufferWritePosition = 0;
}
}
receivedBytes = 0;
}
} while (1);
}
The socket is configured in blocking mode, so the read function stays open
until a message arrived. The current implementation is working, but when
I measure the time between reading a message and the next waiting state of
the read function (see set/reset GPIO comment) the time varies between 30 us
(the mean value) and > 200 us. A value greather than 200us means
(CAN has a baud rate of 1000 kBit/s) that packages are not recognized while
the read() handles the previous message. The read() function must be ready within
134 us.
How can I accelerate my implementation? I tried to use two threads which are
separated with Mutexes (lock before the read() function and unlock after a
message reception), but this didn't solve my problem.

PIC18f4550 : Configuring RB4 for bi-directional data

I am looking to configure my PIC so I can use port RB4 and send out pulses to a device and then receive data on the same port. For this I need to configure RB4 to be a digital I/O port and then;
set as output
lowsignal
1mS delay
highsignal
1mS delay
set as input
read input
This code then loops. So I have;
for(i=0;i<10;i++) // There are 10 bits of data to read
{
ADCON0bits.ADON = 0;
TRISBbits.TRISB4 = 0; // set to output
ADCON0bits.ADON = 1;
LATBbits.LATB4 = 0; // output low
LATBbits.LATB4 = 1; // output high
delay(1);
ADCON0bits.ADON = 0;
TRISBbits.TRISB4 = 1; // configure for input
ADCON0bits.ADON = 1;
inData = inData<<1;
delay(1);
if (PORTBbits.RB4==1)
inData++;
}
But I don't seem to be getting the inputs. I am new to the PIC world. Can anyone point me in the right direction? Is it possible to switch between input and output like this? Am I doing the right thing, the way I am configuring?
Many thanks!
I am a bit late to the party.
You are recommended to use interrupts in that part of code where you are awaiting for the data to be received. Polling is not generally a good approach, you will face much more complicated implementation rather than having a simple counter in the interrupt servicing routine.
So you should enable interrupt-on-change for 4th pin of PORTB.

Synchronized Block takes more time after instrumenting with ASM

I am trying to instrument java synchronized block using ASM. The problem is that after instrumenting, the execution time of the synchronized block takes more time. Here it increases from 2 msecs to 200 msecs on Linux box.
I am implementing this by identifying the MonitorEnter and MonitorExit opcode.
I try to instrument at three level 1. just before the MonitorEnter 2. after MonitorEnter 3. Before MonitorExit.
1 and 3 together works fine, but when i do 2, the execution time increase dramatically.
Even if we instrument another single SOP statement, which is intended to be executed just once, it give higher values.
Here the sample code (prime number, 10 loops):
for(int w=0;w<10;w++){
synchronized(s){
long t1 = System.currentTimeMillis();
long num = 2000;
for (long i = 1; i < num; i++) {
long p = i;
int j;
for (j = 2; j < p; j++) {
long n = p % i;
}
}
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1) );
}
Here the code for instrumention (here System.currentMilliSeconds() gives the time at which instrumention happened, its no the measure of execution time, the excecution time is from obove SOP statement):
public void visitInsn(int opcode)
{
switch(opcode)
{
// Scenario 1
case 194:
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io /PrintStream;");
visitLdcInsn("TIME Arrive: "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
break;
// scenario 3
case 195:
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
visitLdcInsn("TIME exit : "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
break;
}
super.visitInsn(opcode);
// scenario 2
if(opcode==194)
{
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
visitLdcInsn("TIME enter: "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
}
}
I am not able to find the reason why it is happening and how t correct it.
Thanks in advance.
The reason lies in the internals of the JVM that you were using for running the code. I assume that this was a HotSpot JVM but the answers below are equally right for most other implementations.
If you trigger the following code:
int result = 0;
for(int i = 0; i < 1000; i++) {
result += i;
}
This will be translated directly into Java byte code by the Java compiler but at run time the JVM will easily see that this code is not doing anything. Executing this code will have no effect on the outside (application) world, so why should the JVM execute it? This consideration is exactly what compiler optimization does for you.
If you however trigger the following code:
int result = 0;
for(int i = 0; i < 1000; i++) {
System.out.println(result);
}
the Java runtime cannot optimize away your code anymore. The whole loop must always run since the System.out.println(int) method is always doing something real such that your code will run slower.
Now let's look at your example. In your first example, you basically write this code:
synchronized(s) {
// do nothing useful
}
This entire code block can easily be removed by the Java run time. This means: There will be no synchronization! In the second example, you are writing this instead:
synchronized(s) {
long t1 = System.currentTimeMillis();
// do nothing useful
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1));
}
This means that the effective code might be look like this:
synchronized(s) {
long t1 = System.currentTimeMillis();
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1));
}
What is important here is that this optimized code will be effectively synchronized what is an important difference with respect to execution time. Basically, you are measuring the time it costs to synchronize something (and even that might be optimized away after a couple of runs if the JVM realized that the s is not locked elsewhere in your code (buzzword: temporary optimization with the possibility of deoptimization if loaded code in the future will also synchronize on s).
You should really read this:
http://www.ibm.com/developerworks/java/library/j-jtp02225/
http://www.ibm.com/developerworks/library/j-jtp12214/
Your test for example misses a warm-up, such that you are also measuring how much time the JVM will use for byte code to machine code optimization.
On a side note: Synchronizing on a String is almost always a bad idea. Your strings might be or might not be interned what means that you cannot be absolutely sure about their identity. This means, that synchronization might or might not work and you might even inflict synchronization of other parts of your code.

Resources