I have a somewhat complex procedure that contains nested loop and a subgroupBarrier.
In a simplified form it looks like
debugPrintfEXT("Finish! %d", some_variable);
// do some final stuff
return; // this is the only return in the entire procedure
// do some stuff
// do some stuff
Overall the procedure is correct and it does what's expected from it. All subgroup threads
always eventually reach the end condition. However, in my logs I see
Finish! 3
And it's not just the matter of logs being displayed out of order. I perform atomic addition and it seems to be wrong too. I need all threads to finish all their atomic operations before printing Finish!. If the subgroupBarrier() worked correctly, it should print 4, but in my case it prints 3. I've been mostly following this tutorial
and it says that
void subgroupBarrier() performs a full memory and execution barrier - basically when an invocation returns from subgroupBarrier() we are guaranteed that every invocation executed the barrier before any return, and all memory writes by those invocations are visible to all invocations in the subgroup.
Interestingly I tried changing if(gl_SubgroupInvocationID.x==0) to other numbers. For example if(gl_SubgroupInvocationID.x==3) yields
Finish! 2
So it seems like the subgroupBarrier() is entirely ignored.
Could the nested loop be the cause of the problem or is it something else?
I provide here more detailed code
#version 450
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_EXT_debug_printf : enable
layout (local_size_x_id = GROUP_SIZE_CONST_ID) in; // this is a specialization constant whose value always matches the subgroupSize
shared uint copied_faces_idx;
void main() {
const uint chunk_offset = gl_WorkGroupID.x;
const uint lID = gl_LocalInvocationID.x;
// ... Some less important stuff happens here ...
const uint[2] ending = uint[2](relocated_leading_faces_ending, relocated_trailing_faces_ending);
const uint[2] beginning = uint[2](offset_to_relocated_leading_faces, offset_to_relocated_trailing_faces);
uint part = 0;
face_offset = lID;
Face face_to_relocate = faces[face_offset];
debugPrintfEXT("Stop 1: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
subgroupBarrier(); // I added this just to test see what happens
debugPrintfEXT("Stop 2: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
while(face_offset >= ending[part]){
debugPrintfEXT("Stop 3: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
debugPrintfEXT("Stop 4: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
for(uint i=lID;i<inserted_face_count;i+=GROUP_SIZE){
uint offset = atomicAdd(copied_faces_idx,1);
face_to_relocate = faces_to_be_inserted[i];
debugPrintfEXT("Stop 5: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
tmp_faces_copy[offset+1] = face_to_relocate.x;
tmp_faces_copy[offset+2] = face_to_relocate.y;
subgroupBarrier(); // Let's make sure that copied_faces_idx has been incremented by all threads.
debugPrintfEXT("Finish! %d",copied_faces_idx);
face_offset = beginning[part] + lID;
face_to_relocate = faces[face_offset];
remove_face(face_offset, i);
debugPrintfEXT("remove_face: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
face_to_relocate = faces[face_offset];
Basically what this code does is equivalent to
outer1:for(every face X in polygon beginning){
for(every face Y to be removed from polygons){
continue outer1;
outer2:for(every face X in polygon ending){
for(every face Y to be removed from polygons){
continue outer2;
for(every face Z to be inserted in the middle of polygon){
The reason why my code looks so convoluted is because I wrote it in a way that is more parallelizable and tries to minimize the number of inactive threads (considering that usually threads in the same subgroup have to execute the same instruction).
I also added a bunch more of debug prints and one more barrier just to see what happens. Here are the logs that i got
Stop 1: 0 0
Stop 1: 0 1
Stop 1: 0 2
Stop 1: 0 3
Stop 2: 0 0
Stop 2: 0 1
Stop 2: 0 2
Stop 2: 0 3
Stop 3: 0 2
Stop 3: 0 3
Stop 4: 0 2
Stop 4: 0 3
Stop 5: 0 2
Stop 5: 0 3
remove_face: 0 0
Stop 3: 0 0
Stop 4: 0 0
Stop 5: 0 0
Finish! 3 // at this point value 3 is saved (which is the wrong value)
remove_face: 0 1
Stop 3: 0 1
Stop 4: 0 1
Stop 5: 0 1 // at this point atomic is incremented and becomes 4 (which is the correct value)

I found the reason why my code did not work. So it turns out that I misunderstood how exactly subgroupBarrier() decides which threads to synchronize. If a thread is inactive then it will not participate in the barrier. It doesn't matter whether the inactive thread will later become active and will eventually reach the barrier.
Those two loops are not equivalent (even though it seems like they are)
If all threads reach the end condition in the exact same iteration, then there is no problem, because all threads are active at the same time.
The issue appears when different threads might exit the loop at different iterations. If thread A passes the end condition after 2 iterations and thread B
passes end condition after 3 iterations, then there will be one entire iteration between them when A is inactive and waiting for B to finish.
In the first scenario, A will reach break first, then B will reach break second and the finally both threads will exit the loop and arrive at the barrier.
In the second scenario, A will reach the end condition first and execute the if statement, while B will be inactive, waiting for A to finish. As A reaches the barrier it will be the only active thread at that point in time and hence it will pass through the barrier without synchronizing with B. Then A will finish executing the body of if statement reach return and become inactive. Then B will actually become active again and finish executing its iteration. Then in the next iteration it will reach end condition and barrier and again ti will be the only active thread so the barrier won't have to synchronize anything.


Does MPI_Scatter influence MPI_Bcast?

I'm sending an integer that triggers termination via MPI_Bcast. The root sets a variable called "running" to zero and sends the BCast. The Bcast seems to complete but I can't see that the value is sent to the other processes. The other processes seem to be waiting for an MPI_Scatter to complete. They shouldn't even be able to arrive here.
I have done much research on MPI_Bcast and from what I understand it should be blocking. This is confusing me since the MPI_Bcast from the root seems to complete even though I can't find the matching (receiving) MPI_Bcasts for the other processes. I have surrounded all of my MPI_Bcasts with printfs and the output of those printfs 1) print and 2) print the correct values from the root.
The root looks as follows:
while (running || ...) {
/*Do stuff*/
if (...) {
running = 0;
printf("Running = %d and Bcast from root\n", running);
MPI_Bcast(&running, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Root 0 Bcast complete. Running %d\n", running);
/* Do some more stuff and eventually reach Finalize */
printf("Root is Finalizing\n");
The other processes have the following code:
while (running) {
printf("Waiting on BCast from root with myRank: %d\n", rank);
MPI_Bcast(&running, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("P%d received running = %d\n", rank, running);
if (running == 0) { // just to make sure.
I also have the following in the function "doThisFunction()". This is where the processes seem to be waiting for process 0:
int doThisFunction(...) {
/*Do stuff*/
printf("P%d waiting on Scatter\n", rank);
MPI_Scatter(buffer, 130, MPI_BYTE, encoded, 130, MPI_BYTE, 0, MPI_COMM_WORLD);
printf("P%d done with Scatter\n", rank);
/*Do stuff*/
printf("P%d waiting on gather\n", rank);
MPI_Gather(encoded, 1, MPI_INT, buffer, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("P%d done with gater\n", rank);
/*Do Stuff*/
return aValue;
The output in the command line looks as follows:
P0 waiting on Scatter
P0 done with Scatter
P0 waiting on gather
P0 done with gather
Waiting on BCast from root with myRank: 1
P1 received running = 1
P1 waiting on Scatter
P0 waiting on Scatter
P0 done with Scatter
P0 waiting on gather
P0 done with gather
P1 done with Scatter
P1 waiting on gather
P1 done with gather
Waiting on BCast from root with myRank: 1
P1 received running = 1
P1 waiting on Scatter
Running = 0 and Bcast from root
Root 0 Bcast complete. Running 0
/* Why does it say the Bcast is complete
/* even though P1 didn't output that it received it?
Root is Finalizing
/* Deadlocked...
I'm expecting that P1 receives running as zero and then goes into MPI_Finalize() but rather it gets stuck at the scatter which will not be accessed by the root which is already trying to finalize.
In actuality, the program is in deadlock and won't terminate MPI.
I doubt that the problem is that the scatter is accepting the Bcast value because this doesn't even make sense since the root doesn't call scatter.
Does anyone please have any tips on how to resolve this problem?
Your help is greatly appreciated.
Why does it say the Bcast is complete even though P1 didn't output that it received it?
Note the following definitions from the MPI Standard:
Collective operations can (but are not required to) complete as soon as the caller's participation in the collective communication is finished. ... The completion of a collective operation indicates that the caller is free to modify locations in the communication buffer. It does not indicate that other processes in the group have completed or even started the operation (unless otherwise implied by the description of the operation). Thus, a collective communication operation may, or may not, have the effect of synchronizing all calling processes. This statement excludes, of course, the barrier operation.
According to this definition, your MPI_Bcast on the root process can finish even if there is no MPI_Bcast called by slaves.
(For point-to-point operations, we have different communication modes, such as the synchronous one, to address these issues. Unfortunately, there is no synchronous mode for collectives.)
There seems to be some problem in your code with the order of operations. The root called MPI_Bcast, but process #1 did not and was waiting on MPI_Scatter as your log output indicates.

IO Completion ports: separate thread pool to process the dequeued packets?

NOTE: I have added the C++ tag to this because a) the code is C++ and b) people using C++ may well have used IO completion ports. So please don't shout.
I am playing with IO completion ports, and have eventually fully understood (and tested, to prove) - both with help from RbMm - the meaning of the NumberOfConcurrentThreads parameter within CreateIoCompletionPort().
I have the following small program which creates 10 threads all waiting on the completion port. I tell my completion port to only allow 4 threads to be runnable at once (I have four CPUs). I then enqueue 8 packets to the port. My thread function outputs a message if it dequeues a packet with an ID > 4; in order for this message to be output, I have to stop at least one of the four currently running threads, which happens when I enter '1' at the console.
Now this is all fairly simple code. I have one big concern however, and that is that if all of the threads that are processing a completion packet get bogged down, it will mean no more packets can be dequeued and processed. That is what I am simulating with my infinite loop - the fact that no more packets are dequeued until I enter '1' at the console highlights this potential problem!
Would a better solution not be to have my four threads dequeuing packets (or as many threads as CPUs), then when one is dequeued, farm the processing of that packet off to a worker thread from a separate pool, thereby removing the risk of all threads in the IOCP being bogged down thus no more packets being dequeued?
I ask this as all the examples of IO completion port code I have seen use a method similar to what I show below, not using a separate thread pool which I propose. This is what makes me think that I am missing something because I am outnumbered!
Note: this is a somewhat contrived example, because Windows will allow an additional packet to be dequeued if one of the runnable threads enters a wait state; I show this in my code with a commented out cout call:
The system also allows a thread waiting in GetQueuedCompletionStatus
to process a completion packet if another running thread associated
with the same I/O completion port enters a wait state for other
reasons, for example the SuspendThread function. When the thread in
the wait state begins running again, there may be a brief period when
the number of active threads exceeds the concurrency value. However,
the system quickly reduces this number by not allowing any new active
threads until the number of active threads falls below the concurrency
But I won't be calling SuspendThread in my thread functions, and I don't know which functions other than cout will cause the thread to enter a wait state, thus I can't predict if one or more of my threads will ever get bogged down! Hence my idea of a thread pool; at least context switching would mean that other packets get a chance to be dequeued!
#include <windows.h>
#include <thread>
#include <vector>
#include <algorithm>
#include <atomic>
#include <ctime>
#include <iostream>
using namespace std;
int main()
HANDLE hCompletionPort1;
if ((hCompletionPort1 = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 4)) == NULL)
return -1;
vector<thread> vecAllThreads;
atomic_bool bStop(false);
// Fill our vector with 10 threads, each of which waits on our IOCP.
generate_n(back_inserter(vecAllThreads), 10, [hCompletionPort1, &bStop] {
thread t([hCompletionPort1, &bStop]()
// Thread body
while (true)
DWORD dwBytes = 0;
LPOVERLAPPED pOverlapped = 0;
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
//cout << uKey; // EVEN THIS WILL CAUSE A "wait" which causes MORE THAN 4 THREADS TO ENTER!
if (uKey >4)
cout << "Started processing packet ID > 4!" << endl;
while (!bStop)
return move(t);
// Queue 8 completion packets to our IOCP...only four will be processed until we set our bool
for (int i = 1; i <= 8; ++i)
PostQueuedCompletionStatus(hCompletionPort1, 0, i, new OVERLAPPED);
while (!bStop)
int nVal;
cout << "Enter 1 to cause current processing threads to end: ";
cin >> nVal;
bStop = (nVal == 1);
for (int i = 0; i < 10; ++i) // Tell all 10 threads to stop processing on the IOCP
PostQueuedCompletionStatus(hCompletionPort1, 0, 0, 0); // Special packet marking end of IOCP usage
for_each(begin(vecAllThreads), end(vecAllThreads), mem_fn(&thread::join));
return 0;
What I mean by "separate thread pool" is something like the following:
class myThread {
void SetTask(LPOVERLAPPED pO) { /* start processing pO*/ }
thread m_thread; // Actual thread object
// The threads in this thread pool are not associated with the IOCP in any way whatsoever; they exist
// purely to be handed a completion packet which they then process!
class ThreadPool
void Initialise() { /* create 100 worker threads and add them to some internal storage*/}
myThread& GetNextFreeThread() { /* return one of the 100 worker thread we created*/}
} g_threadPool;
The code that each of my four threads associated with the IOCP then change to
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
// Pick a new thread from a pool of pre-created threads and assign it the packet to process
myThread& thr = g_threadPool.GetNextFreeThread();
// Now, this thread can immediately return to the IOCP; it doesn't matter if the
// packet we dequeued would take forever to process; that is happening in the
// separate thread thr *that will not intefere with packets being dequeued from IOCP!*
This way, there is no possible way that I can end up in the situation where no more packets are being dequeued!
It seems there is conflicting opinion on whether a separate thread pool should be used. Clearly, as the sample code I have posted shows, there is potential for packets to stop being dequeued from the IOCP if the processing of the packets does not enter a wait state; given, the infinite loop is perhaps unrealistic but it does demonstrate the point.

OpenCL possible reason a clGetEventInfo would cause a segfault?

I have a pretty complicated OpenCL app. It fires up 5 different contexts on 5 different GPUs, and executes the same kernel on all of them, splitting up the work into 1024 "chunks" to be processed.
Each time a kernel finishes, a result is checked for, and it's given a new chunk. Sometimes, when running, as the app is starting (very rarely mid-run) it will immediately segfault on the GetEventInfo call.
This is done in a loop using callbacks and clGetEventInfo calls to ensure something is finished before moving on to the next step.
GDB output:
(gdb) back
#0 0x00007fdc686ab525 in clGetEventInfo () from /usr/lib/
#1 0x00000000004018c1 in ready (event=0x26a00000267) at gputest.c:165
#2 0x0000000000404b5a in main (argc=9, argv=0x7fffdfe3b268) at gputest.c:544
The ready function:
int ready(cl_event event) {
int rdy;
return 0;
clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &rdy, NULL);
if(rdy == CL_COMPLETE)
return 1;
return 0;
How the kernel is run, the event set, and checked. Some pseudocode inserted for brevity:
while(test if loop is complete) {
for(j = 0; j < GPUS; j++) {
if(gpu[j].waiting && loops < 9999) {
gpu[j].waiting = 0;
offset[j] = loops * 1024 * 1024;
EC("kernel init", clEnqueueNDRangeKernel(queues[j], kernel_init[j], 1, &(offset[j]), &global_work_size, &work128, 0, NULL, &events[j]));
gpu[j].readsearch = events[j];
gpu[j].reading = 1;
for(j = 0; j < GPUS; j++) {
if(gpu[j].reading && ready(gpu[j].readsearch)) {
gpu[j].reading = 0;
gpu[j].waiting = 1;
// unrelated reporting other code here
Its pretty simple. There is more to the code, but it's unrelated. The ready/checking function is very simple. I even added debugging to the ready function to printf the event # to see what was happening when it crashed - nothing really. No pattern I could see.
What could be causing this?
Ugh. Found the problem. Since you cannot initialize values when you create/declare a struct, I was using some values uninitialized. I malloc'ed the gpu structs then just started using them. With if(gpu[x].reading &&...) being random data and completely uninitialized. So sometimes it was non-zero, which allowed the ready() function to fire off. Since the gpu[x].readsearch event was never set in the first place, clGetEventInfo bombed trying to use whatever was at the memory location.
This would be time number 482,847 that accidentally using uninitialized variables has burned me.

Synchronized Block takes more time after instrumenting with ASM

I am trying to instrument java synchronized block using ASM. The problem is that after instrumenting, the execution time of the synchronized block takes more time. Here it increases from 2 msecs to 200 msecs on Linux box.
I am implementing this by identifying the MonitorEnter and MonitorExit opcode.
I try to instrument at three level 1. just before the MonitorEnter 2. after MonitorEnter 3. Before MonitorExit.
1 and 3 together works fine, but when i do 2, the execution time increase dramatically.
Even if we instrument another single SOP statement, which is intended to be executed just once, it give higher values.
Here the sample code (prime number, 10 loops):
for(int w=0;w<10;w++){
long t1 = System.currentTimeMillis();
long num = 2000;
for (long i = 1; i < num; i++) {
long p = i;
int j;
for (j = 2; j < p; j++) {
long n = p % i;
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1) );
Here the code for instrumention (here System.currentMilliSeconds() gives the time at which instrumention happened, its no the measure of execution time, the excecution time is from obove SOP statement):
public void visitInsn(int opcode)
// Scenario 1
case 194:
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io /PrintStream;");
visitLdcInsn("TIME Arrive: "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
// scenario 3
case 195:
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
visitLdcInsn("TIME exit : "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
// scenario 2
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
visitLdcInsn("TIME enter: "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
I am not able to find the reason why it is happening and how t correct it.
Thanks in advance.
The reason lies in the internals of the JVM that you were using for running the code. I assume that this was a HotSpot JVM but the answers below are equally right for most other implementations.
If you trigger the following code:
int result = 0;
for(int i = 0; i < 1000; i++) {
result += i;
This will be translated directly into Java byte code by the Java compiler but at run time the JVM will easily see that this code is not doing anything. Executing this code will have no effect on the outside (application) world, so why should the JVM execute it? This consideration is exactly what compiler optimization does for you.
If you however trigger the following code:
int result = 0;
for(int i = 0; i < 1000; i++) {
the Java runtime cannot optimize away your code anymore. The whole loop must always run since the System.out.println(int) method is always doing something real such that your code will run slower.
Now let's look at your example. In your first example, you basically write this code:
synchronized(s) {
// do nothing useful
This entire code block can easily be removed by the Java run time. This means: There will be no synchronization! In the second example, you are writing this instead:
synchronized(s) {
long t1 = System.currentTimeMillis();
// do nothing useful
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1));
This means that the effective code might be look like this:
synchronized(s) {
long t1 = System.currentTimeMillis();
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1));
What is important here is that this optimized code will be effectively synchronized what is an important difference with respect to execution time. Basically, you are measuring the time it costs to synchronize something (and even that might be optimized away after a couple of runs if the JVM realized that the s is not locked elsewhere in your code (buzzword: temporary optimization with the possibility of deoptimization if loaded code in the future will also synchronize on s).
You should really read this:
Your test for example misses a warm-up, such that you are also measuring how much time the JVM will use for byte code to machine code optimization.
On a side note: Synchronizing on a String is almost always a bad idea. Your strings might be or might not be interned what means that you cannot be absolutely sure about their identity. This means, that synchronization might or might not work and you might even inflict synchronization of other parts of your code.

Scala stateful actor, recursive calling faster than using vars?

Sample code below. I'm a little curious why MyActor is faster than MyActor2. MyActor recursively calls process/react and keeps state in the function parameters whereas MyActor2 keeps state in vars. MyActor even has the extra overhead of tupling the state but still runs faster. I'm wondering if there is a good explanation for this or if maybe I'm doing something "wrong".
I realize the performance difference is not significant but the fact that it is there and consistent makes me curious what's going on here.
Ignoring the first two runs as warmup, I get:
import scala.actors._
object Const {
val NUM = 100000
val NM1 = NUM - 1
trait Send[MessageType] {
def send(msg: MessageType)
// Test 1 using recursive calls to maintain state
abstract class StatefulTypedActor[MessageType, StateType](val initialState: StateType) extends Actor with Send[MessageType] {
def process(state: StateType, message: MessageType): StateType
def act = proc(initialState)
def send(message: MessageType) = {
this ! message
private def proc(state: StateType) {
react {
case msg: MessageType => proc(process(state, msg))
object MyActor extends StatefulTypedActor[Int, (Int, Long)]((0, 0)) {
override def process(state: (Int, Long), input: Int) = input match {
case 0 =>
(1, System.currentTimeMillis())
case input: Int =>
state match {
case (Const.NM1, start) =>
println((System.currentTimeMillis() - start))
(Const.NUM, start)
case (s, start) =>
(s + 1, start)
// Test 2 using vars to maintain state
object MyActor2 extends Actor with Send[Int] {
private var state = 0
private var strt = 0: Long
def send(message: Int) = {
this ! message
def act =
loop {
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
// main: Run testing
object TestActors {
def main(args: Array[String]): Unit = {
val a = MyActor
// val a = MyActor2
def testIt(a: Send[Int]) {
for (_ <- 0 to 5) {
for (i <- 0 to Const.NUM) {
a send i
EDIT: Based on Vasil's response, I removed the loop and tried it again. And then MyActor2 based on vars leapfrogged and now might be around 10% or so faster. So... lesson is: if you are confident that you won't end up with a stack overflowing backlog of messages, and you care to squeeze every little performance out... don't use loop and just call the act() method recursively.
Change for MyActor2:
override def act() =
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
Such results are caused with the specifics of your benchmark (a lot of small messages that fill the actor's mailbox quicker than it can handle them).
Generally, the workflow of react is following:
Actor scans the mailbox;
If it finds a message, it schedules the execution;
When the scheduling completes, or, when there're no messages in the mailbox, actor suspends (Actor.suspendException is thrown);
In the first case, when the handler finishes to process the message, execution proceeds straight to react method, and, as long as there're lots of messages in the mailbox, actor immediately schedules the next message to execute, and only after that suspends.
In the second case, loop schedules the execution of react in order to prevent a stack overflow (which might be your case with Actor #1, because tail recursion in process is not optimized), and thus, execution doesn't proceed to react immediately, as in the first case. That's where the millis are lost.
UPDATE (taken from here):
Using loop instead of recursive react
effectively doubles the number of
tasks that the thread pool has to
execute in order to accomplish the
same amount of work, which in turn
makes it so any overhead in the
scheduler is far more pronounced when
using loop.
Just a wild stab in the dark. It might be due to the exception thrown by react in order to evacuate the loop. Exception creation is quite heavy. However I don't know how often it do that, but that should be possible to check with a catch and a counter.
The overhead on your test depends heavily on the number of threads that are present (try using only one thread with scala -Dactors.corePoolSize=1!). I'm finding it difficult to figure out exactly where the difference arises; the only real difference is that in one case you use loop and in the other you do not. Loop does do fair bit of work, since it repeatedly creates function objects using "andThen" rather than iterating. I'm not sure whether this is enough to explain the difference, especially in light of the heavy usage by scala.actors.Scheduler$.impl and ExceptionBlob.
