Timing of parallel actions using the Task Parallel Library in C# - performance

I am running some experiments, timing them and comparing the times to find the best "algorithm". The question that came up was if running the tasks in parallel would make the relative runningtimes of the experiments wrong and if I would get more representative results by running them sequentially. Here is a (simplified) version of the code:
public static void RunExperient(IEnumerable<Action> experiments)
{
Parallel.ForEach(experiments, experiment =>
{
var sw = Stopwatch.StartNew(); //line 1
experiment(); //line 2
sw.Stop(); //line 3
Console.WriteLine(#"Time was {0}", sw.ElapsedMilliseconds);
});
}
My questions are about what is happening "behind the scenes":
When a task has started, is it possible that the OS or the framework can suspend the task during its execution and continue on later making the running time of the experiment all wrong?
Would I get more representative results by running the experiments sequentially?

That depends on the machine that you are running on and what the experiments do, but generally the answer is yes, they may affect one another. Mainly through resource starvation. Here's an example:
public class Piggy {
public void GreedyExperiment() {
Thread.Priority = ThreadPriority.Highest;
for (var i=0;i<1000000000;i++) {
var j = Math.Sqrt(i / 5);
}
}
}
That's going to do a tight loop on a high priority thread, which will basically consume one processor until it is done. If you only have one processor in the machine and TPL decides to schedule two experiments on it, the other one is going to be starved for CPU time.

Related

How to run load test Scenario once every hour?

I have a loadTest with several scenarios, running for 12 hours.
I want to add another scenario, that will run once a hour, by 10 virtual users.
The ugly solution I'm using is to have 12 additional scenarios, each one has its own "delayed start", with 1 hour interval.
This is an ugly solution.
How can I tell a specific scenario to run once a hour.
Note: For this case I don't need it to run sharp each hour. The main idea is to have a task that run +/- every hour.
I suggest having a load test with two scenarios, one for the main user load, the other for the hourly 10-user case. Then arrange that the number of virtual users (VUs) for the 10-user is set to 10 at the start of every hour and reduced as appropriate. The question does not state how long the 10-user tests runs each hour.
The basic way of achieving this is to modify m_loadTest.Scenarios[N].CurrentLoad, for a suitable N, in a load test heartbeat plugin. The heartbeat is called, as it name suggests, frequently during the test. So arrange that it checks the run time of the test and at the start of each hour assign m_loadTest.Scenarios[N].CurrentLoad = 10 and a short time later set it back to 0 (i.e. zero). I believe that setting the value to a smaller value than its previous value allows the individual test executions by a VU to run to their natural end but the VUs will not start new tests that would exceed the value.
The plugin code then could look similar to the following (untested):
public class TenUserLoadtPlugin : ILoadTestPlugin
{
const int durationOf10UserTestInSeconds = ...; // Not specified in question.
const int scenarioNumber = ...; // Experiment to determine this.
public void Initialize(LoadTest loadTest)
{
m_loadTest = loadTest;
// Register to listen for the heartbeat event
loadTest.Heartbeat += new EventHandler<HeartbeatEventArgs>(loadTest_Heartbeat);
}
void loadTest_Heartbeat(object sender, HeartbeatEventArgs e)
{
int secondsWithinCurrentHour = e.ElapsedSeconds % (60*60);
int loadWanted = secondsWithinCurrentHour > durationOf10UserTestInSeconds ? 0 : 10;
m_loadTest.Scenarios[scenarioNumber].CurrentLoad = loadWanted;
}
LoadTest m_loadTest;
}
There are several web pages about variations on this topic. Searching for terms such as "Visual Studio custom load patterns". See this page for one example.

Creating an async method which throws an exception after a specified amount of time unless a certain condition is met outside of that function

I am currently working on a Ruby script which is supposed to perform different tasks on a pretty long list of hosts. I am using the net-ssh gem for connectivity with those hosts. The thing is, there seem to exist some conditions under which net-ssh times out without throwing an exception. As of know, the script was only once able to finish a run. Most of the time, the scripts just hangs at some point without ever throwing an exception or doing anything.
I thought about running all tasks that may timeout in different threads, passing them a pointer to some variable they can change when the tasks finished successfully, and then check that variable for a given amount of time. If the task has not finished by then, throw an exception in the main thread that I can catch somewhere.
This is the first time I am writing something in Ruby. To give a clear demonstration of what I want to accomplish, this is what I'd do in C++:
void perform_long_running_task(bool* finished);
void start_task_and_throw_on_timeout(int secs, std::function<void(bool*)> func);
int seconds_to_wait {5};
int seconds_task_takes{6};
int main() {
start_task_and_throw_on_timeout(seconds_to_wait, &perform_long_running_task);
// do other stuff
return 0;
}
void perform_long_running_task(bool* finished){
// Do something that may possible timeout..
std::this_thread::sleep_for(std::chrono::seconds(seconds_task_takes));
// Finished..
*finished = true;
}
void start_task_and_throw_on_timeout(int secs, std::function<void(bool*)> func){
bool finished {false};
std::thread task(func, &finished);
while (secs > 0){
std::this_thread::sleep_for(std::chrono::seconds(1));
secs--;
if (finished){
task.join();
return;
}
}
throw std::exception();
}
Here, when 'seconds_task_takes' is bigger than 'seconds_to_wait', an exception is thrown in the main thread. If the task finishes in time, everything goes on smoothly.
However, I need to write my piece of software in a dynamic scripting language that can run anywhere and needs not to be compiled. I would be super glad for any advice about how I could write something like the code above in Ruby.
Thanks alot in advance :)
edit: in the example ,I added a std::function parameter to start_task_and_throw_timeout so it's reusable for all similar functions
I think module timeout has everything you need to do. It allows you to run the block for a while and raise an exception if it was not fast enough.
Here is a code example:
require "timeout"
def run(name)
puts "Running the job #{name}"
sleep(10)
end
begin
Timeout::timeout(5) { run("hard") }
rescue Timeout::Error
puts "Failed!"
end
You can play with it here: https://repl.it/repls/CraftyUnluckyCore. The documentation for the module lives here: https://ruby-doc.org/stdlib-2.5.1/libdoc/timeout/rdoc/Timeout.html. Notice that you can customize not only the timeout, but also error class and message, so different jobs may have different kinds of errors.

Fetch 1M records in orientdb: why is it 6x slower than bare SQL+MySQL

For some graph algorithm I need to fetch a lot of records from a database to memory (~ 1M records). I want this to be done fast and I want the records to be objects (that is: I want ORM). To crudely benchmark different solutions I created a simple problem of one table with 1M Foo objects like I did here: Why is loading SQLAlchemy objects via the ORM 5-8x slower than rows via a raw MySQLdb cursor? .
One can see that fetching them using bare SQL is extremely fast; also converting the records to objects using a simple for-loop is fast. Both execute in around 2-3 seconds. However using ORM's like SQLAlchemy and Hibernate, this takes 20-30 seconds: a lot slower if you ask me, and this is just a simple example without relations and joins.
SQLAlchemy gives itself the feature "Mature, High Performing Architecture," (http://www.sqlalchemy.org/features.html). Similarly for Hibernate "High Performance" (http://hibernate.org/orm/). In a way both are right, because they allow for very generic object oriented data models to be mapped back and forth to a MySQL database. On the other hand they are awfully wrong, since they are 10x slower than just SQL and native code. Personally I think they could do better benchmarks to show this, that is, a benchmark comparing with native SQL + java or python. But that is not the problem at hand.
Of course, I don't want SQL + native code, as it is hard to maintain. So I was wondering why there does not exist something like an object oriented database, which handles the database->object mapping native. Someone suggested OrientDB, hence I tried it. The API is quite nice: when you have your getters and setters right, the object is insertable and selectable.
But I want more than just API-sweetness, so I tried the 1M example:
import java.io.Serializable;
public class Foo implements Serializable {
public Foo() {}
public Foo(int a, int b, int c) { this.a=a; this.b=b; this.c=c; }
public int a,b,c;
public int getA() { return a; }
public void setA(int a) { this.a=a; }
public int getB() { return b; }
public void setB(int b) { this.b=b; }
public int getC() { return c; }
public void setC(int c) { this.c=c; }
}
import com.orientechnologies.orient.object.db.OObjectDatabaseTx;
public class Main {
public static void insert() throws Exception {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
int N=1000000;
long time = System.currentTimeMillis();
for(int i=0; i<N; i++) {
Foo foo = new Foo(i, i*i, i+i*i);
db.save(foo);
}
db.close();
System.out.println(System.currentTimeMillis() - time);
}
public static void fetch() {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
long time = System.currentTimeMillis();
for (Foo f : db.browseClass(Foo.class).setFetchPlan("*:-1")) {
if(f.getA() == 345234) System.out.println(f.getB());
}
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
public static void main(String[] args) throws Exception {
//insert();
fetch();
}
}
Fetching 1M Foo's using OrientDB takes approximately 18 seconds. The for-loop with the getA() is to force the object fields to be actually loaded into memory, as I noticed that by default they are fetched lazily. I guess this may also be the reason fetching the Foo's is slow, because it has db-access each iteration instead of db-access once when it fetches everything (including the fields).
I tried to fix that using setFetchPlan("*:-1"), I figured it may also apply on fields, but that did not seem to work.
Question: Is there a way to do this fast, preferably in the 2-3 seconds range? Why does this take 18 seconds, whilst the bare SQL version uses 3 seconds?
Addition: Using a ODatabaseDocumentTX like #frens-jan-rumph suggested only gave ma a speedup of approximately 5, but of approximatelt 2. Adjusting the following code gave me a running time of approximately 9 seconds. This is still 3 times slower than raw sql whilst no conversion to Foo's was executed. Almost all time goes to the for-loop.
public static void fetch() {
ODatabaseDocumentTx db = new ODatabaseDocumentTx ("plocal:/opt/orientdb-community-1.7.6/databases/pits2").open("admin", "admin");
long time = System.currentTimeMillis();
ORecordIteratorClass<ODocument> it = db.browseClass("Foo");
it.setFetchPlan("*:0");
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
time = System.currentTimeMillis();
for (ODocument f : it) {
//if((int)f.field("a") == 345234) System.out.println(f.field("b"));
}
System.out.println("Iterating all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
The answer lies in convenience.
During an interview, when I asked a candidate what they thought of LINQ (C# I know, but pertinent to your question), they quite rightly answered that it was a sacrifice of performance, over convenience.
A hand-written SQL statement (whether or not it calls a stored procedure) is always going to be faster than using an ORM that auto-magically converts the results of the query in to nice, easy-to-use POCOs.
That said, the difference should not be that great as you have experienced. Yes, there is overhead in doing it the auto-magical way, but it shouldn't be that great. I do have experience here, and within C# I have had to use special reflection classes to reduce the time it takes to do this auto-magical mapping.
With large swabs of data, I would expect an initial slow-down from an ORM, but then it would be negligible. 3 seconds to 18 seconds is huge.
If you profile your test, you would discover that around 60 - 80% of the CPU time is taken by execution of the following four methods:
com.orienttechnologies...OObjectEntitySerializer.getField(...)
com.orienttechnologies...OObjectEntityEnhancer.getProxiedInstance(...)
com.orienttechnologies...OObjectMethodFilter.isScalaClass(...)
javaassist...SecurityActions.getDeclaredMethods(...)
So yes, in this setup the bottleneck is in the ORM layer. Using ODatabaseDocumentTx provides a speedup of around 5x. Might just get you where you want to be.
Still a lot of time (close to 50%) is spent in com.orientechnologies...OJNADirectMemory.getInt(...). That's expensive for just reading an integer from a memory location. Don't understand why not just the java nio bytebuffers are used here. Saves a lot of crossing the Java / native border, etc.
Apart from these micro benchmarks and remarkable behaviour in OrientDB I think that there are at least two other things to consider:
Does this test reflect your expected workload?
I.e. you read a straightforward list of records. If so, why use a database? If not, then test on the actual workload, e.g. your searches, graph traversals, etc.
Does this test reflect your expected setup?
E.g. you are reading from a plocal database while reading from any database over tcp/ip might just as well have its bottleneck somewhere else. Also, you are reading from one thread / process; if you expect concurrent use of the database, this probably throws things off considerably (disk seeks, more book keeping overhead, etc.)
P.S. I would recommend warming up code before benchmarking
What you do here is a worst case scenario. As you wrote (or should have wrote) for your database your test is just reading a table and writes it directly to a stream of whatever.
So what you see is the complete overhead of alot of magic. Usually if you do something more complex like joining, selecting, filtering and ordering the overhead of your ORM comes down to a more reasonable share of 5 to 10%.
Another thing you should think about - I guess orient is doing the same - the ORM solution is creating new objects multiplying memory consumption and Java is really bad on memory consumption and the reason why I use custom in memory tables all the time I handle a lot of data / objects.
You know where an object is a row in a table.
Another thing your objects get also inserted into a list / map (at least Hibernate is doing it). It tracks the dirtiness of the objects once you change them. This insertion also takes a lot of time when you rescale it and is a reason why we use paginated lists or maps. copying 1M references is dead slow if the area grows.

Producer / Consumer - Producer using high CPU

I have a consumer as part of the producer consumer pattern:
simplified:
public class MessageFileLogger : ILogger
{
private BlockingCollection<ILogItem> _messageQueue;
private Thread _worker;
private bool _enabled = false;
public MessageFileLogger()
{
_worker = new Thread(LogMessage);
_worker.IsBackground = true;
_worker.Start();
}
private void LogMessage()
{
while (_enabled)
{
if (_messageQueue.Count > 0)
{
itm = _messageQueue.Take();
processItem(itm);
}
else
{
Thread.Sleep(1000);
}
}
}
}
If I remove the
Thread.Sleep(1000);
The CPU usages climbs to something extremely high (13%) as opposed to 0%, with setting the thread to sleep.
Also, if I instantiate multiple instances of the class, the CPU usage climbs in 13% increments, with each instance.
A new LogItem is added the BlockingCollection about every minute or so (maybe every 30 seconds), and writes an applicable message to a file.
Is it possible that the thread is somehow blocking other threads from running, and the system somehow needs to compensate?
Update:
Updated code to better reflect actual code
You gave the thread code to run, so by default it runs that code (the while loop) as fast as it possibly can on a single logical core. Since that's about 13%, I'd imagine your CPU has 4 hyperthreaded cores, resulting in 8 logical cores. Each thread runs it's while loop as fast as it possibly can on it's core, resulting in another 13% usage. Pretty straightforward really.
Side effects of not using sleep are that the whole system runs slower, and uses/produces SIGNIFICANTLY more battery/heat.
Generally, the proper way is to give the _messageQueue another method like
bool BlockingCollection::TryTake(type& item, std::chrono::milliseconds time)
{
DWORD Ret = WaitForSingleObject(event, time.count());
if (Ret)
return false;
item = Take(); //might need to use a shared function instead of calling direct
return true;
}
Then your loop is easy:
private void LogMessage()
{
type item;
while (_enabled)
{
if (_messageQueue.Take(item, std::chrono::seconds(1)))
;//your origional code makes little sense, but this is roughly the same
processItem(itm);
}
}
It also means that if an item is added at any point during the blocking part, it's acted on immediately instead of up to a full second later.

Efficient Independent Synchronized Blocks?

I have a scenario where, at certain points in my program, a thread needs to update several shared data structures. Each data structure can be safely updated in parallel with any other data structure, but each data structure can only be updated by one thread at a time. The simple, naive way I've expressed this in my code is:
synchronized updateStructure1();
synchronized updateStructure2();
// ...
This seems inefficient because if multiple threads are trying to update structure 1, but no thread is trying to update structure 2, they'll all block waiting for the lock that protects structure 1, while the lock for structure 2 sits untaken.
Is there a "standard" way of remedying this? In other words, is there a standard threading primitive that tries to update all structures in a round-robin fashion, blocks only if all locks are taken, and returns when all structures are updated?
This is a somewhat language agnostic question, but in case it helps, the language I'm using is D.
If your language supported lightweight threads or Actors, you could always have the updating thread spawn a new a new thread to change each object, where each thread just locks, modifies, and unlocks each object. Then have your updating thread join on all its child threads before returning. This punts the problem to the runtime's schedule, and it's free to schedule those child threads any way it can for best performance.
You could do this in langauges with heavier threads, but the spawn and join might have too much overhead (though thread pooling might mitigate some of this).
I don't know if there's a standard way to do this. However, I would implement this something like the following:
do
{
if (!updatedA && mutexA.tryLock())
{
scope(exit) mutexA.unlock();
updateA();
updatedA = true;
}
if (!updatedB && mutexB.tryLock())
{
scope(exit) mutexB.unlock();
updateB();
updatedB = true;
}
}
while (!(updatedA && updatedB));
Some clever metaprogramming could probably cut down the repetition, but I leave that as an exercise for you.
Sorry if I'm being naive, but do you not just Synchronize on objects to make the concerns independent?
e.g.
public Object lock1 = new Object; // access to resource 1
public Object lock2 = new Object; // access to resource 2
updateStructure1() {
synchronized( lock1 ) {
...
}
}
updateStructure2() {
synchronized( lock2 ) {
...
}
}
To my knowledge, there is not a standard way to accomplish this, and you'll have to get your hands dirty.
To paraphrase your requirements, you have a set of data structures, and you need to do work on them, but not in any particular order. You only want to block waiting on a data structure if all other objects are blocked. Here's the pseudocode I would base my solution on:
work = unshared list of objects that need updating
while work is not empty:
found = false
for each obj in work:
try locking obj
if successful:
remove obj from work
found = true
obj.update()
unlock obj
if !found:
// Everything is locked, so we have to wait
obj = randomly pick an object from work
remove obj from work
lock obj
obj.update()
unlock obj
An updating thread will only block if it finds that all objects it needs to use are locked. Then it must wait on something, so it just picks one and locks it. Ideally, it would pick the object that will be unlocked earliest, but there's no simple way of telling that.
Also, it's conceivable that an object might become free while the updater is in the try loop and so the updater would skip it. But if the amount of work you're doing is large enough, relative to the cost of iterating through that loop, the false conflict should be rare, and it would only matter in cases of extremely high contention.
I don't know any "standard" way of doing this, sorry. So this below is just a ThreadGroup, abstracted by a Swarm-class, that »hacks» at a job list until all are done, round-robin style, and makes sure that as many threads as possible are used. I don't know how to do this without a job list.
Disclaimer: I'm very new to D, and concurrency programming, so the code is rather amateurish. I saw this more as a fun exercise. (I'm too dealing with some concurrency stuff.) I also understand that this isn't quite what you're looking for. If anyone has any pointers I'd love to hear them!
import core.thread,
core.sync.mutex,
std.c.stdio,
std.stdio;
class Swarm{
ThreadGroup group;
Mutex mutex;
auto numThreads = 1;
void delegate ()[int] jobs;
this(void delegate()[int] aJobs, int aNumThreads){
jobs = aJobs;
numThreads = aNumThreads;
group = new ThreadGroup;
mutex = new Mutex();
}
void runBlocking(){
run();
group.joinAll();
}
void run(){
foreach(c;0..numThreads)
group.create( &swarmJobs );
}
void swarmJobs(){
void delegate () myJob;
do{
myJob = null;
synchronized(mutex){
if(jobs.length > 0)
foreach(i,job;jobs){
myJob = job;
jobs.remove(i);
break;
}
}
if(myJob)
myJob();
}while(myJob)
}
}
class Jobs{
void job1(){
foreach(c;0..1000){
foreach(j;0..2_000_000){}
writef("1");
fflush(core.stdc.stdio.stdout);
}
}
void job2(){
foreach(c;0..1000){
foreach(j;0..1_000_000){}
writef("2");
fflush(core.stdc.stdio.stdout);
}
}
}
void main(){
auto jobs = new Jobs();
void delegate ()[int] jobsList =
[1:&jobs.job1,2:&jobs.job2,3:&jobs.job1,4:&jobs.job2];
int numThreads = 2;
auto swarm = new Swarm(jobsList,numThreads);
swarm.runBlocking();
writefln("end");
}
There's no standard solution but rather a class of standard solutions depending on your needs.
http://en.wikipedia.org/wiki/Scheduling_algorithm

Resources