Why atomic.Load not called to load gcphase in runtime.GC?

Why atomic.Load not called to load gcphase in runtime.GC? - go

I wonder why gcphase is not protected with atomic.Load:
n := atomic.Load(&work.cycles)
if gcphase == _GCmark {
// Wait until sweep termination, mark, and mark
// termination of cycle N complete.
gp.schedlink = work.sweepWaiters.head
work.sweepWaiters.head.set(gp)
goparkunlock(&work.sweepWaiters.lock, "wait for GC cycle", traceEvGoBlock, 1)
} else {
// We're in sweep N already.
unlock(&work.sweepWaiters.lock)
}
Anyone knows?

Code excerpt:
func setGCPhase(x uint32) {
atomic.Store(&gcphase, x)
writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination
writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo
}
while gcphase is a global variable, but all writes to gcphase are done through the above function.
There were several variables in runtime which aren't paired properly, but seems they have reason for it and were sure they were having the exclusive access to it.
Here is the issue https://github.com/golang/go/issues/21931 filed about the same and here https://go-review.googlesource.com/c/go/+/65210 GC developers had some discussions on changing the same.

Related

rw_semaphore's negative count value

I am debugging a kernel crash dump. There seems to be a problem with one process was trying to memory map a new region. The problem is that it was not able to hold the memory map semaphore.
When I looked into process's mm_struct and printed its content. I saw that the struct rw_semaphore mmap_sem were as seen below. Now, does he value of count seem suspicious? It has a negative value, as if there was a race condition where it was decremented twice by two different threads after checking for zero.
mmap_sem = {
count = -4294967295,
wait_lock = {
{
rlock = {
raw_lock = {
slock = 262148
}
}
}
},
wait_list = {
next = 0xffff8801f0113e48,
prev = 0xffff8801f0113e48
}
},

Sorry for the confusion. I thought crash pulls the correct data types and uses that properly when printing out the all the values ...
Looks like crash utility is not read the count member as an int ....
When I print it as int, I get the correct value.
crash> p (int) (((struct mm_struct *) 0xffff8801f15fa540)->mmap_sem).count
$13 = 1

OpenCL possible reason a clGetEventInfo would cause a segfault?

I have a pretty complicated OpenCL app. It fires up 5 different contexts on 5 different GPUs, and executes the same kernel on all of them, splitting up the work into 1024 "chunks" to be processed.
Each time a kernel finishes, a result is checked for, and it's given a new chunk. Sometimes, when running, as the app is starting (very rarely mid-run) it will immediately segfault on the GetEventInfo call.
This is done in a loop using callbacks and clGetEventInfo calls to ensure something is finished before moving on to the next step.
GDB output:
(gdb) back
#0 0x00007fdc686ab525 in clGetEventInfo () from /usr/lib/libOpenCL.so.1
#1 0x00000000004018c1 in ready (event=0x26a00000267) at gputest.c:165
#2 0x0000000000404b5a in main (argc=9, argv=0x7fffdfe3b268) at gputest.c:544
The ready function:
int ready(cl_event event) {
int rdy;
if(!event)
return 0;
clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &rdy, NULL);
if(rdy == CL_COMPLETE)
return 1;
return 0;
}
How the kernel is run, the event set, and checked. Some pseudocode inserted for brevity:
while(test if loop is complete) {
for(j = 0; j < GPUS; j++) {
if(gpu[j].waiting && loops < 9999) {
gpu[j].waiting = 0;
offset[j] = loops * 1024 * 1024;
loops++;
EC("kernel init", clEnqueueNDRangeKernel(queues[j], kernel_init[j], 1, &(offset[j]), &global_work_size, &work128, 0, NULL, &events[j]));
gpu[j].readsearch = events[j];
gpu[j].reading = 1;
}
}
for(j = 0; j < GPUS; j++) {
if(gpu[j].reading && ready(gpu[j].readsearch)) {
gpu[j].reading = 0;
gpu[j].waiting = 1;
// unrelated reporting other code here
}
}
}
Its pretty simple. There is more to the code, but it's unrelated. The ready/checking function is very simple. I even added debugging to the ready function to printf the event # to see what was happening when it crashed - nothing really. No pattern I could see.
What could be causing this?

Ugh. Found the problem. Since you cannot initialize values when you create/declare a struct, I was using some values uninitialized. I malloc'ed the gpu structs then just started using them. With if(gpu[x].reading &&...) being random data and completely uninitialized. So sometimes it was non-zero, which allowed the ready() function to fire off. Since the gpu[x].readsearch event was never set in the first place, clGetEventInfo bombed trying to use whatever was at the memory location.
This would be time number 482,847 that accidentally using uninitialized variables has burned me.

Using par map to increase performance

Below code runs a comparison of users and writes to file. I've removed some code to make it as concise as possible but speed is an issue also in this code :
import scala.collection.JavaConversions._
object writedata {
def getDistance(str1: String, str2: String) = {
val zipped = str1.zip(str2)
val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2
val p = zipped.count(_ == ('1', '1')).toFloat * 2
val q = zipped.count(_ == ('1', '0')).toFloat * 2
val r = zipped.count(_ == ('0', '1')).toFloat * 2
val s = zipped.count(_ == ('0', '0')).toFloat * 2
(q + r) / (p + q + r)
} //> getDistance: (str1: String, str2: String)Float
case class UserObj(id: String, nCoordinate: String)
val userList = new java.util.ArrayList[UserObj] //> userList : java.util.ArrayList[writedata.UserObj] = []
for (a <- 1 to 100) {
userList.add(new UserObj("2", "101010"))
}
def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
try { f(param) } finally { param.close() } //> using: [A <: AnyRef{def close(): Unit}, B](param: A)(f: A => B)B
def appendToFile(fileName: String, textData: String) =
using(new java.io.FileWriter(fileName, true)) {
fileWriter =>
using(new java.io.PrintWriter(fileWriter)) {
printWriter => printWriter.println(textData)
}
} //> appendToFile: (fileName: String, textData: String)Unit
var counter = 0; //> counter : Int = 0
for (xUser <- userList.par) {
userList.par.map(yUser => {
if (!xUser.id.isEmpty && !yUser.id.isEmpty)
synchronized {
appendToFile("c:\\data-files\\test.txt", getDistance(xUser.nCoordinate , yUser.nCoordinate).toString)
}
})
}
}
The above code was previously an imperative solution, so the .par functionality was within an inner and outer loop. I'm attempting to convert it to a more functional implementation while also taking advantage of Scala's parallel collections framework.
In this example the data set size is 10 but in the code im working on
the size is 8000 which translates to 64'000'000 comparisons. I'm
using a synchronized block so that multiple threads are not writing
to same file at same time. A performance improvment im considering
is populating a separate collection within the inner loop ( userList.par.map(yUser => {)
and then writing that collection out to seperate file.
Are there other methods I can use to improve performance. So that I can
handle a List that contains 8000 items instead of above example of 100 ?

I'm not sure if you removed too much code for clarity, but from what I can see, there is absolutely nothing that can run in parallel since the only thing you are doing is writing to a file.
EDIT:
One thing that you should do is to move the getDistance(...) computation before the synchronized call to appendToFile, otherwise your parallelized code ends up being sequential.
Instead of calling a synchronized appendToFile, I would call appendToFile in a non-synchronized way, but have each call to that method add the new line to some synchronized queue. Then I would have another thread that flushes that queue to disk periodically. But then you would also need to add something to make sure that the queue is also flushed when all computations are done. So that could get complicated...
Alternatively, you could also keep your code and simply drop the synchronization around the call to appendToFile. It seems that println itself is synchronized. However, that would be risky since println is not officially synchronized and it could change in future versions.

Scala stateful actor, recursive calling faster than using vars?

Sample code below. I'm a little curious why MyActor is faster than MyActor2. MyActor recursively calls process/react and keeps state in the function parameters whereas MyActor2 keeps state in vars. MyActor even has the extra overhead of tupling the state but still runs faster. I'm wondering if there is a good explanation for this or if maybe I'm doing something "wrong".
I realize the performance difference is not significant but the fact that it is there and consistent makes me curious what's going on here.
Ignoring the first two runs as warmup, I get:
MyActor:
559
511
544
529
vs.
MyActor2:
647
613
654
610
import scala.actors._
object Const {
val NUM = 100000
val NM1 = NUM - 1
}
trait Send[MessageType] {
def send(msg: MessageType)
}
// Test 1 using recursive calls to maintain state
abstract class StatefulTypedActor[MessageType, StateType](val initialState: StateType) extends Actor with Send[MessageType] {
def process(state: StateType, message: MessageType): StateType
def act = proc(initialState)
def send(message: MessageType) = {
this ! message
}
private def proc(state: StateType) {
react {
case msg: MessageType => proc(process(state, msg))
}
}
}
object MyActor extends StatefulTypedActor[Int, (Int, Long)]((0, 0)) {
override def process(state: (Int, Long), input: Int) = input match {
case 0 =>
(1, System.currentTimeMillis())
case input: Int =>
state match {
case (Const.NM1, start) =>
println((System.currentTimeMillis() - start))
(Const.NUM, start)
case (s, start) =>
(s + 1, start)
}
}
}
// Test 2 using vars to maintain state
object MyActor2 extends Actor with Send[Int] {
private var state = 0
private var strt = 0: Long
def send(message: Int) = {
this ! message
}
def act =
loop {
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
}
}
}
// main: Run testing
object TestActors {
def main(args: Array[String]): Unit = {
val a = MyActor
// val a = MyActor2
a.start()
testIt(a)
}
def testIt(a: Send[Int]) {
for (_ <- 0 to 5) {
for (i <- 0 to Const.NUM) {
a send i
}
}
}
}
EDIT: Based on Vasil's response, I removed the loop and tried it again. And then MyActor2 based on vars leapfrogged and now might be around 10% or so faster. So... lesson is: if you are confident that you won't end up with a stack overflowing backlog of messages, and you care to squeeze every little performance out... don't use loop and just call the act() method recursively.
Change for MyActor2:
override def act() =
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
act()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
act()
}

Such results are caused with the specifics of your benchmark (a lot of small messages that fill the actor's mailbox quicker than it can handle them).
Generally, the workflow of react is following:
Actor scans the mailbox;
If it finds a message, it schedules the execution;
When the scheduling completes, or, when there're no messages in the mailbox, actor suspends (Actor.suspendException is thrown);
In the first case, when the handler finishes to process the message, execution proceeds straight to react method, and, as long as there're lots of messages in the mailbox, actor immediately schedules the next message to execute, and only after that suspends.
In the second case, loop schedules the execution of react in order to prevent a stack overflow (which might be your case with Actor #1, because tail recursion in process is not optimized), and thus, execution doesn't proceed to react immediately, as in the first case. That's where the millis are lost.
UPDATE (taken from here):
Using loop instead of recursive react
effectively doubles the number of
tasks that the thread pool has to
execute in order to accomplish the
same amount of work, which in turn
makes it so any overhead in the
scheduler is far more pronounced when
using loop.

Just a wild stab in the dark. It might be due to the exception thrown by react in order to evacuate the loop. Exception creation is quite heavy. However I don't know how often it do that, but that should be possible to check with a catch and a counter.

The overhead on your test depends heavily on the number of threads that are present (try using only one thread with scala -Dactors.corePoolSize=1!). I'm finding it difficult to figure out exactly where the difference arises; the only real difference is that in one case you use loop and in the other you do not. Loop does do fair bit of work, since it repeatedly creates function objects using "andThen" rather than iterating. I'm not sure whether this is enough to explain the difference, especially in light of the heavy usage by scala.actors.Scheduler$.impl and ExceptionBlob.

Redundant code constructs

The most egregiously redundant code construct I often see involves using the code sequence
if (condition)
return true;
else
return false;
instead of simply writing
return (condition);
I've seen this beginner error in all sorts of languages: from Pascal and C to PHP and Java. What other such constructs would you flag in a code review?

if (foo == true)
{
do stuff
}
I keep telling the developer that does that that it should be
if ((foo == true) == true)
{
do stuff
}
but he hasn't gotten the hint yet.

if (condition == true)
{
...
}
instead of
if (condition)
{
...
}
Edit:
or even worse and turning around the conditional test:
if (condition == false)
{
...
}
which is easily read as
if (condition) then ...

Using comments instead of source control:
-Commenting out or renaming functions instead of deleting them and trusting that source control can get them back for you if needed.
-Adding comments like "RWF Change" instead of just making the change and letting source control assign the blame.

Somewhere I’ve spotted this thing, which I find to be the pinnacle of boolean redundancy:
return (test == 1)? ((test == 0) ? 0 : 1) : ((test == 0) ? 0 : 1);
:-)

Redundant code is not in itself an error. But if you're really trying to save every character
return (condition);
is redundant too. You can write:
return condition;

Declaring separately from assignment in languages other than C:
int foo;
foo = GetFoo();

Returning uselessly at the end:
// stuff
return;
}

I once had a guy who repeatedly did this:
bool a;
bool b;
...
if (a == true)
b = true;
else
b = false;

void myfunction() {
if(condition) {
// Do some stuff
if(othercond) {
// Do more stuff
}
}
}
instead of
void myfunction() {
if(!condition)
return;
// Do some stuff
if(!othercond)
return;
// Do more stuff
}

Using .tostring on a string

Putting an exit statement as first statement in a function to disable the execution of that function, instead of one of the following options:
Completely removing the function
Commenting the function body
Keeping the function but deleting all the code
Using the exit as first statement makes it very hard to spot, you can easily read over it.

Fear of null (this also can lead to serious problems):
if (name != null)
person.Name = name;
Redundant if's (not using else):
if (!IsPostback)
{
// do something
}
if (IsPostback)
{
// do something else
}
Redundant checks (Split never returns null):
string[] words = sentence.Split(' ');
if (words != null)
More on checks (the second check is redundant if you are going to loop)
if (myArray != null && myArray.Length > 0)
foreach (string s in myArray)
And my favorite for ASP.NET: Scattered DataBinds all over the code in order to make the page render.

Copy paste redundancy:
if (x > 0)
{
// a lot of code to calculate z
y = x + z;
}
else
{
// a lot of code to calculate z
y = x - z;
}
instead of
if (x > 0)
y = x + CalcZ(x);
else
y = x - CalcZ(x);
or even better (or more obfuscated)
y = x + (x > 0 ? 1 : -1) * CalcZ(x)

Allocating elements on the heap instead of the stack.
{
char buff = malloc(1024);
/* ... */
free(buff);
}
instead of
{
char buff[1024];
/* ... */
}
or
{
struct foo *x = (struct foo *)malloc(sizeof(struct foo));
x->a = ...;
bar(x);
free(x);
}
instead of
{
struct foo x;
x.a = ...;
bar(&x);
}

The most common redundant code construct I see is code that is never called from anywhere in the program.
The other is design patterns used where there is no point in using them. For example, writing "new BobFactory().createBob()" everywhere, instead of just writing "new Bob()".
Deleting unused and unnecessary code can massively improve the quality of the system and the team's ability to maintain it. The benefits are often startling to teams who have never considered deleting unnecessary code from their system. I once performed a code review by sitting with a team and deleting over half the code in their project without changing the functionality of their system. I thought they'd be offended but they frequently asked me back for design advice and feedback after that.

I often run into the following:
function foo() {
if ( something ) {
return;
} else {
do_something();
}
}
But it doesn't help telling them that the else is useless here. It has to be either
function foo() {
if ( something ) {
return;
}
do_something();
}
or - depending on the length of checks that are done before do_something():
function foo() {
if ( !something ) {
do_something();
}
}

From nightmarish code reviews.....
char s[100];
followed by
memset(s,0,100);
followed by
s[strlen(s)] = 0;
with lots of nasty
if (strcmp(s, "1") == 0)
littered about the code.

Using an array when you want set behavior. You need to check everything to make sure its not in the array before you insert it, which makes your code longer and slower.

Redundant .ToString() invocations:
const int foo = 5;
Console.WriteLine("Number of Items: " + foo.ToString());
Unnecessary string formatting:
const int foo = 5;
Console.WriteLine("Number of Items: {0}", foo);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why atomic.Load not called to load gcphase in runtime.GC? - go

Related

rw_semaphore's negative count value

OpenCL possible reason a clGetEventInfo would cause a segfault?

Using par map to increase performance

Scala stateful actor, recursive calling faster than using vars?

Redundant code constructs

Categories

Resources