Producer / Consumer - Producer using high CPU - thread-safety

I have a consumer as part of the producer consumer pattern:
simplified:
public class MessageFileLogger : ILogger
{
private BlockingCollection<ILogItem> _messageQueue;
private Thread _worker;
private bool _enabled = false;
public MessageFileLogger()
{
_worker = new Thread(LogMessage);
_worker.IsBackground = true;
_worker.Start();
}
private void LogMessage()
{
while (_enabled)
{
if (_messageQueue.Count > 0)
{
itm = _messageQueue.Take();
processItem(itm);
}
else
{
Thread.Sleep(1000);
}
}
}
}
If I remove the
Thread.Sleep(1000);
The CPU usages climbs to something extremely high (13%) as opposed to 0%, with setting the thread to sleep.
Also, if I instantiate multiple instances of the class, the CPU usage climbs in 13% increments, with each instance.
A new LogItem is added the BlockingCollection about every minute or so (maybe every 30 seconds), and writes an applicable message to a file.
Is it possible that the thread is somehow blocking other threads from running, and the system somehow needs to compensate?
Update:
Updated code to better reflect actual code

You gave the thread code to run, so by default it runs that code (the while loop) as fast as it possibly can on a single logical core. Since that's about 13%, I'd imagine your CPU has 4 hyperthreaded cores, resulting in 8 logical cores. Each thread runs it's while loop as fast as it possibly can on it's core, resulting in another 13% usage. Pretty straightforward really.
Side effects of not using sleep are that the whole system runs slower, and uses/produces SIGNIFICANTLY more battery/heat.
Generally, the proper way is to give the _messageQueue another method like
bool BlockingCollection::TryTake(type& item, std::chrono::milliseconds time)
{
DWORD Ret = WaitForSingleObject(event, time.count());
if (Ret)
return false;
item = Take(); //might need to use a shared function instead of calling direct
return true;
}
Then your loop is easy:
private void LogMessage()
{
type item;
while (_enabled)
{
if (_messageQueue.Take(item, std::chrono::seconds(1)))
;//your origional code makes little sense, but this is roughly the same
processItem(itm);
}
}
It also means that if an item is added at any point during the blocking part, it's acted on immediately instead of up to a full second later.

Related

V8 Garbage Collection Differs For ObjectTemplates and Objects Created With Them

V8's garbage collection seems to easily clean up as it goes a Local<T> value where T anything stored in the Local, however if you create an ObjectTemplate and then create an instance of that Object, v8 will wait to clean up the memory. Consider the following example where the resident set size remains stable throughout program execution:
Isolate* isolate = Isolate::New(create_params);
Persistent<Context> *context= ContextNew(isolate); // creates a persistent context
for(int i = 1 ; i <= 1000000; i ++ ) {
isolate->Enter();
EnterContext(isolate, context); // enters the context
{
HandleScope handle_scope(isolate);
Local<Object> result = Object::New(isolate);
}
ExitContext(isolate, context);
isolate->Exit();
}
Above, all we do is create a new Object in a loop, and then handle_scope goes out of scope and it looks like the Local values allocated are garbage collected right away as the residential set size remains steady. However, there is an issue when this object is created through an ObjectTemplate that is also created in the loop:
Isolate* isolate = Isolate::New(create_params);
Persistent<Context> *context= ContextNew(isolate); // creates a persistent context
for(int i = 1 ; i <= 1000000; i ++ ) {
isolate->Enter();
EnterContext(isolate, context); // enters the context
{
HandleScope handle_scope(isolate);
Local<Object> result;
Local<ObjectTemplate> templ = ObjectTemplate::New(isolate);
if (!templ->NewInstance(context->Get(isolate)).ToLocal(&result)) { exit(1); }
}
ExitContext(isolate, context);
isolate->Exit();
}
Here, the resident set size increases linearly until an unnecessary amount of ram is used for such a small program. Just looking to understand what is happening here. Sorry for the long explanation, i tried to keep it short and to the point :p. Thanks in advance!
V8 assumes that ObjectTemplates are long-lived and hence allocates them in the "old generation" part of the heap, where it takes longer for them to get collected by a (comparatively slow and rare) full GC cycle -- if the assumption was right and they actually are long-lived, this is an overall performance win. Objects themselves, on the other hand, are allocated in the "young generation", where they are quick and easy to collect by the (comparatively frequent) young-generation GC cycles.
If you run with --trace-gc you should see this explanation confirmed.

Why filter with side effects performs better than a Spliterator based implementation?

Regarding the question How to skip even lines of a Stream obtained from the Files.lines I followed the accepted answer approach implementing my own filterEven() method based on Spliterator<T> interface, e.g.:
public static <T> Stream<T> filterEven(Stream<T> src) {
Spliterator<T> iter = src.spliterator();
AbstractSpliterator<T> res = new AbstractSpliterator<T>(Long.MAX_VALUE, Spliterator.ORDERED)
{
#Override
public boolean tryAdvance(Consumer<? super T> action) {
iter.tryAdvance(item -> {}); // discard
return iter.tryAdvance(action); // use
}
};
return StreamSupport.stream(res, false);
}
which I can use in the following way:
Stream<DomainObject> res = Files.lines(src)
filterEven(res)
.map(line -> toDomainObject(line))
However measuring the performance of this approach against the next one which uses a filter() with side effects I noticed that the next one performs better:
final int[] counter = {0};
final Predicate<String> isEvenLine = item -> ++counter[0] % 2 == 0;
Stream<DomainObject> res = Files.lines(src)
.filter(line -> isEvenLine ())
.map(line -> toDomainObject(line))
I tested the performance with JMH and I am not including the file load in the benchmark. I previously load it into an array. Then each benchmark starts by creating a Stream<String> from previous array, then filtering even lines, then applying a mapToInt() to extract the value of an int field and finally a max() operation. Here it is one of the benchmarks (you can check the whole Program here and here you have the data file with about 186 lines):
#Benchmark
public int maxTempFilterEven(DataSource src){
Stream<String> content = Arrays.stream(src.data)
.filter(s-> s.charAt(0) != '#') // Filter comments
.skip(1); // Skip line: Not available
return filterEven(content) // Filter daily info and skip hourly
.mapToInt(line -> parseInt(line.substring(14, 16)))
.max()
.getAsInt();
}
I am not getting why the filter() approach has better performance (~80ops/ms) than the filterEven() (~50ops/ms)?
Intro
I think I know the reason but unfortunately I have no idea how to improve performance of Spliterator-based solution (at least without rewritting of the whole Streams API feature).
Sidenote 1: performance was not the most important design goal when Stream API was designed. If performance is critical, most probably re-writting the code without Stream API will make the code faster. (For example, Stream API unavoidably increases memory allocation and thus GC-pressure). On the other hand in most of the scenarios Stream API provides a nicer higher-level API at a cost of a relatively small performance degradation.
Part 1 or Short theoretical answer
Stream is designed to implement a kind of internal iteration as the main mean of consuming and external iteration (i.e. Spliterator-based) is an additional mean that is kind of "emulated". Thus external iteration involves some overhead. Laziness adds some limits to the efficiency of external iteration and a need to support flatMap makes it necessary to use some kind of dynamic buffer in this process.
Sidenote 2 In some cases Spliterator-based iteration might be as fast as the internal iteration (i.e. filter in this case). Particularly it is so in the cases when you create a Spliterator directly from that data-containing Stream. To see it, you can modify your tests to materialize your first filter into a Strings array:
String[] filteredData = Arrays.stream(src.data)
.filter(s-> s.charAt(0) != '#') // Filter comments
.skip(1)
.toArray(String[]::new);
and then compare preformance of maxTempFilter and maxTempFilterEven modified to accept that pre-filtered String[] filteredData. If you want to know why this is so, you probably should read the rest of this long answer or at least Part 2.
Part 2 or Longer theoretical answer:
Streams were designed to be mainly consumed as a whole by some terminal operation. Iterating elements one by one although supported is not designed as a main way to consume streams.
Note that using the "functional" Stream API such as map, flatMap, filter, reduce, and collect you can't say at some step "I have had enough data, stop iterating over the source and pushing values". You can discard some incoming data (as filter does) but can't stop iteration. (take and skip transformations are actually implemented using Spliterator inside; and anyMatch, allMatch, noneMatch, findFirst, findAny, etc. use non-public API j.u.s.Sink.cancellationRequested, also they are easier as there can't be several terminal operations). If all transformations in the pipeline are synchronous, you can combine them into a single aggregated function (Consumer) and call it in a simple loop (optionally splitting the loop execution over several thread). This is what my simplified version of the state based filter represents (see the code in the Show me some code section). It gets a bit more complicated if there is a flatMap in the pipeline but idea is still the same.
Spliterator-based transformation is fundamentally different because it adds an asynchronous consumer-driven step to the pipeline. Now the Spliterator rather than the source Stream drives the iteration process. If you ask for a Spliterator directly on the source Stream, it might be able to return you some implementation that just iterates over its internal data structure and this is why materializing pre-filtered data should remove performance difference. However, if you create a Spliterator for some non-empty pipeline, there is no other (simple) choice other than asking the source to push elements one by one through the pipeline until some element passes all the filters (see also second example in the Show me some code section). The fact that source elements are pushed one by one rather than in some batches is a consequence of the fundamental decision to make Streams lazy. The need for a buffer instead of just one element is the consequence of support for flatMap: pushing one element from the source can produce many elements for Spliterator.
Part 3 or Show me some code
This part tries to provide some backing with the code (both links to the real code and simulated code) of what was described in the "theoretical" parts.
First of all, you should know that current Streams API implementation accumulates non-terminal (intermediate) operations into a single lazy pipeline (see j.u.s.AbstractPipeline and its children such as j.u.s.ReferencePipeline. Then, when the terminal operation is applied, all the elements from the original Stream are "pushed" through the pipeline.
What you see is the result of two things:
the fact that streams pipelines are different for cases when you
have a Spliterator-based step inside.
the fact that your OddLines is not the first step in the pipeline
The code with a stateful filter is more or less similar to the following straightforward code:
static int similarToFilter(String[] data)
{
final int[] counter = {0};
final Predicate<String> isEvenLine = item -> ++counter[0] % 2 == 0;
int skip = 1;
boolean reduceEmpty = true;
int reduceState = 0;
for (String outerEl : data)
{
if (outerEl.charAt(0) != '#')
{
if (skip > 0)
skip--;
else
{
if (isEvenLine.test(outerEl))
{
int intEl = parseInt(outerEl.substring(14, 16));
if (reduceEmpty)
{
reduceState = intEl;
reduceEmpty = false;
}
else
{
reduceState = Math.max(reduceState, intEl);
}
}
}
}
}
return reduceState;
}
Note that this is effectively a single loop with some calculations (filtering/transformations) inside.
When you add a Spliterator into the pipeline on the other hand, things change significantly and even with simplifications code that is reasonably similar to what actually happens becomes much larger such as:
interface Sp<T>
{
public boolean tryAdvance(Consumer<? super T> action);
}
static class ArraySp<T> implements Sp<T>
{
private final T[] array;
private int pos;
public ArraySp(T[] array)
{
this.array = array;
}
#Override
public boolean tryAdvance(Consumer<? super T> action)
{
if (pos < array.length)
{
action.accept(array[pos]);
pos++;
return true;
}
else
{
return false;
}
}
}
static class WrappingSp<T> implements Sp<T>, Consumer<T>
{
private final Sp<T> sourceSp;
private final Predicate<T> filter;
private final ArrayList<T> buffer = new ArrayList<T>();
private int pos;
public WrappingSp(Sp<T> sourceSp, Predicate<T> filter)
{
this.sourceSp = sourceSp;
this.filter = filter;
}
#Override
public void accept(T t)
{
buffer.add(t);
}
#Override
public boolean tryAdvance(Consumer<? super T> action)
{
while (true)
{
if (pos >= buffer.size())
{
pos = 0;
buffer.clear();
sourceSp.tryAdvance(this);
}
// failed to fill buffer
if (buffer.size() == 0)
return false;
T nextElem = buffer.get(pos);
pos++;
if (filter.test(nextElem))
{
action.accept(nextElem);
return true;
}
}
}
}
static class OddLineSp<T> implements Sp<T>, Consumer<T>
{
private Sp<T> sourceSp;
public OddLineSp(Sp<T> sourceSp)
{
this.sourceSp = sourceSp;
}
#Override
public boolean tryAdvance(Consumer<? super T> action)
{
if (sourceSp == null)
return false;
sourceSp.tryAdvance(this);
if (!sourceSp.tryAdvance(action))
{
sourceSp = null;
}
return true;
}
#Override
public void accept(T t)
{
}
}
static class ReduceIntMax
{
boolean reduceEmpty = true;
int reduceState = 0;
public int getReduceState()
{
return reduceState;
}
public void accept(int t)
{
if (reduceEmpty)
{
reduceEmpty = false;
reduceState = t;
}
else
{
reduceState = Math.max(reduceState, t);
}
}
}
static int similarToSpliterator(String[] data)
{
ArraySp<String> src = new ArraySp<>(data);
int[] skip = new int[1];
skip[0] = 1;
WrappingSp<String> firstFilter = new WrappingSp<String>(src, (s) ->
{
if (s.charAt(0) == '#')
return false;
if (skip[0] != 0)
{
skip[0]--;
return false;
}
return true;
});
OddLineSp<String> oddLines = new OddLineSp<>(firstFilter);
final ReduceIntMax reduceIntMax = new ReduceIntMax();
while (oddLines.tryAdvance(s ->
{
int intValue = parseInt(s.substring(14, 16));
reduceIntMax.accept(intValue);
})) ; // do nothing in the loop body
return reduceIntMax.getReduceState();
}
This code is larger because the logic is impossible (or at least very hard) to represent without some non-trivial stateful callbacks inside the loop. Here interface Sp is a mix of j.u.s.Stream and j.u.Spliterator interfaces.
Class ArraySp represents a result of Arrays.stream.
Class WrappingSp is similar to j.u.s.StreamSpliterators.WrappingSpliterator which in the real code represents an implementation of Spliterator interface for any non-empty pipeline i.e. a Stream with at least one intermediate operation applied to it (see j.u.s.AbstractPipeline.spliterator method). In my code I merged it with a StatelessOp subclass and put there logic responsible for filter method implementation. Also for simplcity I implemented skip using filter.
OddLineSp corresponds to your OddLines and its resulting Stream
ReduceIntMax represents ReduceOps terminal operation for Math.max for int
So what's important in this example? The important thing here is that since you first filter you original stream, your OddLineSp is created from a non-empty pipeline i.e. from a WrappingSp. And if you take a closer look at WrappingSp, you'll notice that every time tryAdvance is called, it delegates the call to the sourceSp and accumulates that result(s) into a buffer. Moreover, since you have no flatMap in the pipeline, elements to the buffer will be copied one by one. I.e. every time WrappingSp.tryAdvance is called, it will call ArraySp.tryAdvance, get back exactly one element (via callback), and pass it further to the consumer provided by the caller (unless the element doesn't match the filter in which case ArraySp.tryAdvance will be called again and again but still the buffer is never filled with more than one element at a time).
Sidenote 3: If you want to look at the real code, the most intersting places are j.u.s.StreamSpliterators.WrappingSpliterator.tryAdvance which calls
j.u.s.StreamSpliterators.AbstractWrappingSpliterator.doAdvance which in turn calls j.u.s.StreamSpliterators.AbstractWrappingSpliterator.fillBuffer which in turn calls pusher that is initialized at j.u.s.StreamSpliterators.WrappingSpliterator.initPartialTraversalState
So the main thing that's hurting performance is this copying into the buffer.
Unfortunately for us, usual Java developers, current implementation of the Stream API is pretty much closed and you can't modify only some aspects of the internal behavior using inheritance or composition.
You may use some reflection-based hacking to make copying-to-buffer more efficient for your specific case and gain some performance (but sacrifice laziness of the Stream) but you can't avoid this copying altogether and thus Spliterator-based code will be slower anyway.
Going back to the example from the Sidenote #2, Spliterator-based test with materialized filteredData works faster because there is no WrappingSp in the pipeline before OddLineSp and thus there will be no copying into an intermediate buffer.

Fetch 1M records in orientdb: why is it 6x slower than bare SQL+MySQL

For some graph algorithm I need to fetch a lot of records from a database to memory (~ 1M records). I want this to be done fast and I want the records to be objects (that is: I want ORM). To crudely benchmark different solutions I created a simple problem of one table with 1M Foo objects like I did here: Why is loading SQLAlchemy objects via the ORM 5-8x slower than rows via a raw MySQLdb cursor? .
One can see that fetching them using bare SQL is extremely fast; also converting the records to objects using a simple for-loop is fast. Both execute in around 2-3 seconds. However using ORM's like SQLAlchemy and Hibernate, this takes 20-30 seconds: a lot slower if you ask me, and this is just a simple example without relations and joins.
SQLAlchemy gives itself the feature "Mature, High Performing Architecture," (http://www.sqlalchemy.org/features.html). Similarly for Hibernate "High Performance" (http://hibernate.org/orm/). In a way both are right, because they allow for very generic object oriented data models to be mapped back and forth to a MySQL database. On the other hand they are awfully wrong, since they are 10x slower than just SQL and native code. Personally I think they could do better benchmarks to show this, that is, a benchmark comparing with native SQL + java or python. But that is not the problem at hand.
Of course, I don't want SQL + native code, as it is hard to maintain. So I was wondering why there does not exist something like an object oriented database, which handles the database->object mapping native. Someone suggested OrientDB, hence I tried it. The API is quite nice: when you have your getters and setters right, the object is insertable and selectable.
But I want more than just API-sweetness, so I tried the 1M example:
import java.io.Serializable;
public class Foo implements Serializable {
public Foo() {}
public Foo(int a, int b, int c) { this.a=a; this.b=b; this.c=c; }
public int a,b,c;
public int getA() { return a; }
public void setA(int a) { this.a=a; }
public int getB() { return b; }
public void setB(int b) { this.b=b; }
public int getC() { return c; }
public void setC(int c) { this.c=c; }
}
import com.orientechnologies.orient.object.db.OObjectDatabaseTx;
public class Main {
public static void insert() throws Exception {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
int N=1000000;
long time = System.currentTimeMillis();
for(int i=0; i<N; i++) {
Foo foo = new Foo(i, i*i, i+i*i);
db.save(foo);
}
db.close();
System.out.println(System.currentTimeMillis() - time);
}
public static void fetch() {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
long time = System.currentTimeMillis();
for (Foo f : db.browseClass(Foo.class).setFetchPlan("*:-1")) {
if(f.getA() == 345234) System.out.println(f.getB());
}
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
public static void main(String[] args) throws Exception {
//insert();
fetch();
}
}
Fetching 1M Foo's using OrientDB takes approximately 18 seconds. The for-loop with the getA() is to force the object fields to be actually loaded into memory, as I noticed that by default they are fetched lazily. I guess this may also be the reason fetching the Foo's is slow, because it has db-access each iteration instead of db-access once when it fetches everything (including the fields).
I tried to fix that using setFetchPlan("*:-1"), I figured it may also apply on fields, but that did not seem to work.
Question: Is there a way to do this fast, preferably in the 2-3 seconds range? Why does this take 18 seconds, whilst the bare SQL version uses 3 seconds?
Addition: Using a ODatabaseDocumentTX like #frens-jan-rumph suggested only gave ma a speedup of approximately 5, but of approximatelt 2. Adjusting the following code gave me a running time of approximately 9 seconds. This is still 3 times slower than raw sql whilst no conversion to Foo's was executed. Almost all time goes to the for-loop.
public static void fetch() {
ODatabaseDocumentTx db = new ODatabaseDocumentTx ("plocal:/opt/orientdb-community-1.7.6/databases/pits2").open("admin", "admin");
long time = System.currentTimeMillis();
ORecordIteratorClass<ODocument> it = db.browseClass("Foo");
it.setFetchPlan("*:0");
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
time = System.currentTimeMillis();
for (ODocument f : it) {
//if((int)f.field("a") == 345234) System.out.println(f.field("b"));
}
System.out.println("Iterating all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
The answer lies in convenience.
During an interview, when I asked a candidate what they thought of LINQ (C# I know, but pertinent to your question), they quite rightly answered that it was a sacrifice of performance, over convenience.
A hand-written SQL statement (whether or not it calls a stored procedure) is always going to be faster than using an ORM that auto-magically converts the results of the query in to nice, easy-to-use POCOs.
That said, the difference should not be that great as you have experienced. Yes, there is overhead in doing it the auto-magical way, but it shouldn't be that great. I do have experience here, and within C# I have had to use special reflection classes to reduce the time it takes to do this auto-magical mapping.
With large swabs of data, I would expect an initial slow-down from an ORM, but then it would be negligible. 3 seconds to 18 seconds is huge.
If you profile your test, you would discover that around 60 - 80% of the CPU time is taken by execution of the following four methods:
com.orienttechnologies...OObjectEntitySerializer.getField(...)
com.orienttechnologies...OObjectEntityEnhancer.getProxiedInstance(...)
com.orienttechnologies...OObjectMethodFilter.isScalaClass(...)
javaassist...SecurityActions.getDeclaredMethods(...)
So yes, in this setup the bottleneck is in the ORM layer. Using ODatabaseDocumentTx provides a speedup of around 5x. Might just get you where you want to be.
Still a lot of time (close to 50%) is spent in com.orientechnologies...OJNADirectMemory.getInt(...). That's expensive for just reading an integer from a memory location. Don't understand why not just the java nio bytebuffers are used here. Saves a lot of crossing the Java / native border, etc.
Apart from these micro benchmarks and remarkable behaviour in OrientDB I think that there are at least two other things to consider:
Does this test reflect your expected workload?
I.e. you read a straightforward list of records. If so, why use a database? If not, then test on the actual workload, e.g. your searches, graph traversals, etc.
Does this test reflect your expected setup?
E.g. you are reading from a plocal database while reading from any database over tcp/ip might just as well have its bottleneck somewhere else. Also, you are reading from one thread / process; if you expect concurrent use of the database, this probably throws things off considerably (disk seeks, more book keeping overhead, etc.)
P.S. I would recommend warming up code before benchmarking
What you do here is a worst case scenario. As you wrote (or should have wrote) for your database your test is just reading a table and writes it directly to a stream of whatever.
So what you see is the complete overhead of alot of magic. Usually if you do something more complex like joining, selecting, filtering and ordering the overhead of your ORM comes down to a more reasonable share of 5 to 10%.
Another thing you should think about - I guess orient is doing the same - the ORM solution is creating new objects multiplying memory consumption and Java is really bad on memory consumption and the reason why I use custom in memory tables all the time I handle a lot of data / objects.
You know where an object is a row in a table.
Another thing your objects get also inserted into a list / map (at least Hibernate is doing it). It tracks the dirtiness of the objects once you change them. This insertion also takes a lot of time when you rescale it and is a reason why we use paginated lists or maps. copying 1M references is dead slow if the area grows.

Timing of parallel actions using the Task Parallel Library in C#

I am running some experiments, timing them and comparing the times to find the best "algorithm". The question that came up was if running the tasks in parallel would make the relative runningtimes of the experiments wrong and if I would get more representative results by running them sequentially. Here is a (simplified) version of the code:
public static void RunExperient(IEnumerable<Action> experiments)
{
Parallel.ForEach(experiments, experiment =>
{
var sw = Stopwatch.StartNew(); //line 1
experiment(); //line 2
sw.Stop(); //line 3
Console.WriteLine(#"Time was {0}", sw.ElapsedMilliseconds);
});
}
My questions are about what is happening "behind the scenes":
When a task has started, is it possible that the OS or the framework can suspend the task during its execution and continue on later making the running time of the experiment all wrong?
Would I get more representative results by running the experiments sequentially?
That depends on the machine that you are running on and what the experiments do, but generally the answer is yes, they may affect one another. Mainly through resource starvation. Here's an example:
public class Piggy {
public void GreedyExperiment() {
Thread.Priority = ThreadPriority.Highest;
for (var i=0;i<1000000000;i++) {
var j = Math.Sqrt(i / 5);
}
}
}
That's going to do a tight loop on a high priority thread, which will basically consume one processor until it is done. If you only have one processor in the machine and TPL decides to schedule two experiments on it, the other one is going to be starved for CPU time.

Efficient Independent Synchronized Blocks?

I have a scenario where, at certain points in my program, a thread needs to update several shared data structures. Each data structure can be safely updated in parallel with any other data structure, but each data structure can only be updated by one thread at a time. The simple, naive way I've expressed this in my code is:
synchronized updateStructure1();
synchronized updateStructure2();
// ...
This seems inefficient because if multiple threads are trying to update structure 1, but no thread is trying to update structure 2, they'll all block waiting for the lock that protects structure 1, while the lock for structure 2 sits untaken.
Is there a "standard" way of remedying this? In other words, is there a standard threading primitive that tries to update all structures in a round-robin fashion, blocks only if all locks are taken, and returns when all structures are updated?
This is a somewhat language agnostic question, but in case it helps, the language I'm using is D.
If your language supported lightweight threads or Actors, you could always have the updating thread spawn a new a new thread to change each object, where each thread just locks, modifies, and unlocks each object. Then have your updating thread join on all its child threads before returning. This punts the problem to the runtime's schedule, and it's free to schedule those child threads any way it can for best performance.
You could do this in langauges with heavier threads, but the spawn and join might have too much overhead (though thread pooling might mitigate some of this).
I don't know if there's a standard way to do this. However, I would implement this something like the following:
do
{
if (!updatedA && mutexA.tryLock())
{
scope(exit) mutexA.unlock();
updateA();
updatedA = true;
}
if (!updatedB && mutexB.tryLock())
{
scope(exit) mutexB.unlock();
updateB();
updatedB = true;
}
}
while (!(updatedA && updatedB));
Some clever metaprogramming could probably cut down the repetition, but I leave that as an exercise for you.
Sorry if I'm being naive, but do you not just Synchronize on objects to make the concerns independent?
e.g.
public Object lock1 = new Object; // access to resource 1
public Object lock2 = new Object; // access to resource 2
updateStructure1() {
synchronized( lock1 ) {
...
}
}
updateStructure2() {
synchronized( lock2 ) {
...
}
}
To my knowledge, there is not a standard way to accomplish this, and you'll have to get your hands dirty.
To paraphrase your requirements, you have a set of data structures, and you need to do work on them, but not in any particular order. You only want to block waiting on a data structure if all other objects are blocked. Here's the pseudocode I would base my solution on:
work = unshared list of objects that need updating
while work is not empty:
found = false
for each obj in work:
try locking obj
if successful:
remove obj from work
found = true
obj.update()
unlock obj
if !found:
// Everything is locked, so we have to wait
obj = randomly pick an object from work
remove obj from work
lock obj
obj.update()
unlock obj
An updating thread will only block if it finds that all objects it needs to use are locked. Then it must wait on something, so it just picks one and locks it. Ideally, it would pick the object that will be unlocked earliest, but there's no simple way of telling that.
Also, it's conceivable that an object might become free while the updater is in the try loop and so the updater would skip it. But if the amount of work you're doing is large enough, relative to the cost of iterating through that loop, the false conflict should be rare, and it would only matter in cases of extremely high contention.
I don't know any "standard" way of doing this, sorry. So this below is just a ThreadGroup, abstracted by a Swarm-class, that »hacks» at a job list until all are done, round-robin style, and makes sure that as many threads as possible are used. I don't know how to do this without a job list.
Disclaimer: I'm very new to D, and concurrency programming, so the code is rather amateurish. I saw this more as a fun exercise. (I'm too dealing with some concurrency stuff.) I also understand that this isn't quite what you're looking for. If anyone has any pointers I'd love to hear them!
import core.thread,
core.sync.mutex,
std.c.stdio,
std.stdio;
class Swarm{
ThreadGroup group;
Mutex mutex;
auto numThreads = 1;
void delegate ()[int] jobs;
this(void delegate()[int] aJobs, int aNumThreads){
jobs = aJobs;
numThreads = aNumThreads;
group = new ThreadGroup;
mutex = new Mutex();
}
void runBlocking(){
run();
group.joinAll();
}
void run(){
foreach(c;0..numThreads)
group.create( &swarmJobs );
}
void swarmJobs(){
void delegate () myJob;
do{
myJob = null;
synchronized(mutex){
if(jobs.length > 0)
foreach(i,job;jobs){
myJob = job;
jobs.remove(i);
break;
}
}
if(myJob)
myJob();
}while(myJob)
}
}
class Jobs{
void job1(){
foreach(c;0..1000){
foreach(j;0..2_000_000){}
writef("1");
fflush(core.stdc.stdio.stdout);
}
}
void job2(){
foreach(c;0..1000){
foreach(j;0..1_000_000){}
writef("2");
fflush(core.stdc.stdio.stdout);
}
}
}
void main(){
auto jobs = new Jobs();
void delegate ()[int] jobsList =
[1:&jobs.job1,2:&jobs.job2,3:&jobs.job1,4:&jobs.job2];
int numThreads = 2;
auto swarm = new Swarm(jobsList,numThreads);
swarm.runBlocking();
writefln("end");
}
There's no standard solution but rather a class of standard solutions depending on your needs.
http://en.wikipedia.org/wiki/Scheduling_algorithm

Resources