Split a loop into smaller parts - algorithm

I have a function inside a loop that takes a lot of resources and usually does not get completed unless the server is on low load.
How can I split it into smaller loops? When decreasing the value the function executes fine.
As an example, this works:
x = 10
for i = 0; i <= x; i++
{
myfunction(i)
}
However, when increasing x to 100 the memory hogging function stops working.
How can one split 100 in chunks of 10 (i.e.) and run the loop 10 times?
I should be able to use any value, not only 100 or multiples of 10.
Thank you.

Is your function asynchronous and you're having too many instances at once? Is it opening and not closing resources? Perhaps you could put a delay in after every 10 iterations?
x = 1000
for i = 0; i <= x; i++
{
myfunction(i);
if(i%10==0)
{
Thread.Sleep(1000);
}
}

If your task is asynchronous you can use worker thread pool technique.
Ex
You can create thread pool with 10 thread.
First you assign 10 tasks to them.
Whenever a task is finished you can assign remaining task to the given thread.

Related

How to simulate limited RSU capacity in veins?

I have to simulate a scenario with a RSU that has limited processing capacity; it can only process a limited number of messages in a time unit (say 1 second).
I tried to set a counter in the RSU application. the counter is incremented each time the RSU receives a message and decremented after processing it. here is what I have done:
void RSUApp::onBSM(BasicSafetyMessage* bsm)
{
if(msgCount >= capacity)
{
//drop msg
this->getParentModule()->bubble("capacity limit");
return;
}
msgCount++;
//process message here
msgCount--;
}
it seems useless, I tested it using capacity limit=1 and I have 2 vehicles sending messages at the same time. the RSU process both although it should process one and drop the other.
can anyone help me with this?
In the beginning of the onBSM method the counter is incremented, your logic gets executed and finally the counter gets decremented. All those steps happen at once, meaning in one step of the simulation.
This is the reason why you don't see an effect.
What you probably want is a certain amount of "messages" to be processed in a certain time interval (e.g. 500 ms). It could somehow look like this (untested):
if (simTime() <= intervalEnd && msgCount >= capacity)
{
this->getParentModule()->bubble("capacity limit");
return;
} else if (simTime() > intervalEnd) {
intervalEnd = simTime() + YOURINTERVAL;
msgCount = 0;
}
......
The variable YOURINTERVAL would be time amount of time you like to consider as the interval for your capacity.
You can use self messaging with scheduleAt(simTime()+delay, yourmessage);
the delay will simulate the required processing time.

Operating in parallel on a large constant datastructure in Julia

I have a large vector of vectors of strings:
There are around 50,000 vectors of strings,
each of which contains 2-15 strings of length 1-20 characters.
MyScoringOperation is a function which operates on a vector of strings (the datum) and returns an array of 10100 scores (as Float64s). It takes about 0.01 seconds to run MyScoringOperation (depending on the length of the datum)
function MyScoringOperation(state:State, datum::Vector{String})
...
score::Vector{Float64} #Size of score = 10000
I have what amounts to a nested loop.
The outer loop typically would runs for 500 iterations
data::Vector{Vector{String}} = loaddata()
for ii in 1:500
score_total = zeros(10100)
for datum in data
score_total+=MyScoringOperation(datum)
end
end
On one computer, on a small test case of 3000 (rather than 50,000) this takes 100-300 seconds per outer loop.
I have 3 powerful servers with Julia 3.9 installed (and can get 3 more easily, and then can get hundreds more at the next scale).
I have basic experience with #parallel, however it seems like it is spending a lot of time copying the constant (It more or less hang on the smaller testing case)
That looks like:
data::Vector{Vector{String}} = loaddata()
state = init_state()
for ii in 1:500
score_total = #parallel(+) for datum in data
MyScoringOperation(state, datum)
end
state = update(state, score_total)
end
My understanding of the way this implementation works with #parallel is that it:
For Each ii:
partitions data into a chuck for each worker
sends that chuck to each worker
works all process there chunks
main procedure sums the results as they arrive.
I would like to remove step 2,
so that instead of sending a chunk of data to each worker,
I just send a range of indexes to each worker, and they look it up from their own copy of data. or even better, only giving each only their own chunk, and having them reuse it each time (saving on a lot of RAM).
Profiling backs up my belief about the functioning of #parellel.
For a similarly scoped problem (with even smaller data),
the non-parallel version runs in 0.09seconds,
and the parallel runs in
And the profiler shows almost all the time is spent 185 seconds.
Profiler shows almost 100% of this is spend interacting with network IO.
This should get you started:
function get_chunks(data::Vector, nchunks::Int)
base_len, remainder = divrem(length(data),nchunks)
chunk_len = fill(base_len,nchunks)
chunk_len[1:remainder]+=1 #remained will always be less than nchunks
function _it()
for ii in 1:nchunks
chunk_start = sum(chunk_len[1:ii-1])+1
chunk_end = chunk_start + chunk_len[ii] -1
chunk = data[chunk_start: chunk_end]
produce(chunk)
end
end
Task(_it)
end
function r_chunk_data(data::Vector)
all_chuncks = get_chunks(data, nworkers()) |> collect;
remote_chunks = [put!(RemoteRef(pid)::RemoteRef, all_chuncks[ii]) for (ii,pid) in enumerate(workers())]
#Have to add the type annotation sas otherwise it thinks that, RemoteRef(pid) might return a RemoteValue
end
function fetch_reduce(red_acc::Function, rem_results::Vector{RemoteRef})
total = nothing
#TODO: consider strongly wrapping total in a lock, when in 0.4, so that it is garenteed safe
#sync for rr in rem_results
function gather(rr)
res=fetch(rr)
if total===nothing
total=res
else
total=red_acc(total,res)
end
end
#async gather(rr)
end
total
end
function prechunked_mapreduce(r_chunks::Vector{RemoteRef}, map_fun::Function, red_acc::Function)
rem_results = map(r_chunks) do rchunk
function do_mapred()
#assert r_chunk.where==myid()
#pipe r_chunk |> fetch |> map(map_fun,_) |> reduce(red_acc, _)
end
remotecall(r_chunk.where,do_mapred)
end
#pipe rem_results|> convert(Vector{RemoteRef},_) |> fetch_reduce(red_acc, _)
end
rchunk_data breaks the data into chunks, (defined by get_chunks method) and sends those chunks each to a different worker, where they are stored in RemoteRefs.
The RemoteRefs are references to memory on your other proccesses(and potentially computers), that
prechunked_map_reduce does a variation on a kind of map reduce to have each worker first run map_fun on each of it's chucks elements, then reduce over all the elements in its chuck using red_acc (a reduction accumulator function). Finally each worker returns there result which is then combined by reducing them all together using red_acc this time using the fetch_reduce so that we can add the first ones completed first.
fetch_reduce is a nonblocking fetch and reduce operation. I believe it has no raceconditions, though this maybe because of a implementation detail in #async and #sync. When julia 0.4 comes out, it is easy enough to put a lock in to make it obviously have no race conditions.
This code isn't really battle hardened. I don;t believe the
You also might want to look at making the chuck size tunable, so that you can seen more data to faster workers (if some have better network or faster cpus)
You need to reexpress your code as a map-reduce problem, which doesn't look too hard.
Testing that with:
data = [float([eye(100),eye(100)])[:] for _ in 1:3000] #480Mb
chunk_data(:data, data)
#time prechunked_mapreduce(:data, mean, (+))
Took ~0.03 seconds, when distributed across 8 workers (none of them on the same machine as the launcher)
vs running just locally:
#time reduce(+,map(mean,data))
took ~0.06 seconds.

How can a value be accumulated when run in parallel/concurent processes?

I'm running some Ruby scripts concurrently using Grosser/Parallel.
During each concurrent test I want to add up the number of times a particular thing has happened, then display that number.
Let's say:
def main
$this_happened = 0
do_this_in_parallel
puts $this_happened
end
def do_this_in_parallel
Parallel.each(...) {
$this_happened += 1
}
end
The final value after do_this_in_parallel has finished will always be 0
I'd like to know why this happens.
How can I get the desired result which would be $this_happenend > 0?
Thanks.
This doesn't work because separate processes have separate memory spaces: setting variables etc in one process has no effect on what happens in the other process.
However you can return a result from your block (because under the hood parallel sets up pipes so that the processes can be fed input/return results). For example you could do this
counts = Parallel.map(...) do
#the return value of the block should
#be the number of times the event occurred
end
Then just sum the counts to get your total count (eg counts.reduce(:+)). You might also want to read up on map-reduce for more information about this way of parallelising work
I have never used parallel but the documentation seems to suggest that something like this might work.
Parallel.each(..., :finish => lambda {|*_| $this_happened += 1}) { do_work }

Use of pthread increases execution time, suggestions for improvements

I had a piece of code, which looked like this,
for(i=0;i<NumberOfSteps;i++)
{
for(k=0;k<NumOfNodes;k++)
{
mark[crawler[k]]++;
r = rand() % node_info[crawler[k]].num_of_nodes;
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
}
}
I changed it so that the load can be split among multiple threads. Now it looks like this,
for(i=0;i<NumberOfSteps;i++)
{
for(k=0;k<NumOfNodes;k++)
{
pthread_mutex_lock( &mutex1 );
mark[crawler[k]]++;
pthread_mutex_unlock( &mutex1 );
pthread_mutex_lock( &mutex1 );
r = rand() % node_info[crawler[k]].num_of_nodes;
pthread_mutex_unlock( &mutex1 );
pthread_mutex_lock( &mutex1 );
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
pthread_mutex_unlock( &mutex1 );
}
}
I need the mutexes to protect shared variables. It turns out that my parallel code is slower. But why ? Is it because of the mutexes ?
Could this possibly be something to do with the cacheline size ?
You are not parallelizing anything but the loop heads. Everything between lock and unlock is forced to be executed sequentially. And since lock/unlock are (potentially) expensive operations, the code is getting slower.
To fix this, you should at least separate expensive computations (without mutex protection) from access to shared data areas (with mutexes). Then try to move the mutexes out of the inner loop.
You could use atomic increment instructions (depends on platform) instead of plain '++', which is generally cheaper than mutexes. But beware of doing this often on data of a single cache line from different threads in parallel (see 'false sharing').
AFAICS, you could rewrite the algorithm as indicated below with out needing mutexes and atomic increment at all. getFirstK() is NumOfNodes/NumOfThreads*t if NumOfNodes is an integral multiple of NumOfThreads.
for(t=0;t<NumberOfThreads;t++)
{
kbegin = getFirstK(NumOfNodes, NumOfThreads, t);
kend = getFirstK(NumOfNodes, NumOfThreads, t+1);
// start the following in a separate thread with kbegin and kend
// copied to thread local vars kbegin_ and kend_
int k, i, r;
unsigned state = kend_; // really bad seed
for(k=kbegin_;k<kend_;k++)
{
for(i=0;i<NumberOfSteps;i++)
{
mark[crawler[k]]++;
r = rand_r(&state) % node_info[crawler[k]].num_of_nodes;
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
}
}
}
// wait for threads/jobs to complete
This way to generate random numbers may lead to bad random distributions, see this question for details.

How many tasks in parallel

I built a sample program to check the performance of tasks in parallel, with respect to the number of tasks running in parallel.
Few assumptions:
Operation is on thread is independent of another thread, so no synchronization mechanisms between threads are essential.
The idea is to check, whether it is efficient to:
1.Spawn as many tasks as possible
or
2. Restrict the number of tasks in parallel, and wait for some tasks to complete before spawning the remaining tasks.
Following is the program:
static void Main(string[] args)
{
System.IO.StreamWriter writer = new System.IO.StreamWriter("C:\\TimeLogV2.csv");
SemaphoreSlim availableSlots;
for (int slots = 10; slots <= 20000; slots += 10)
{
availableSlots = new SemaphoreSlim(slots, slots);
int maxTasks;
CountdownEvent countDownEvent;
Stopwatch watch = new Stopwatch();
watch.Start();
maxTasks = 20000;
countDownEvent = new CountdownEvent(maxTasks);
for (int i = 0; i < maxTasks; i++)
{
Console.WriteLine(i);
Task task = new Task(() => Thread.Sleep(50));
task.ContinueWith((t) =>
{
availableSlots.Release();
countDownEvent.Signal();
}
);
availableSlots.Wait();
task.Start();
}
countDownEvent.Wait();
watch.Stop();
writer.WriteLine("{0},{1}", slots, watch.ElapsedMilliseconds);
Console.WriteLine("{0}:{1}", slots, watch.ElapsedMilliseconds);
}
writer.Flush();
writer.Close();
}
Here are the results:
The Y-axis is time-taken in milliseconds, X-axis is the number of semaphore slots (refer to above program)
Essentially the trend is: More number of parallel tasks, the better. Now my question is in what conditions, does:
More number of parallel tasks = less optimal (time taken)?
One condition I suppose is that:
The tasks are interdependent, and may have to wait for certain resources to be available.
Have you in any scenario limited the number of parallel tasks?
The TPL will control how many threads are running at once - basically you're just queuing up tasks to be run on those threads. You're not really running all those tasks in parallel.
The TPL will use work-stealing queues to make it all as efficient as possible. If you have all the information about what tasks you need to run, you might as well queue them all to start with, rather than trying to micro-manage it yourself. Of course, this will take memory - so that might be a concern, if you have a huge number of tasks.
I wouldn't try to artificially break your logical tasks into little bits just to get more tasks, however. You shouldn't view "more tasks == better" as a general rule.
(I note that you're including time taken to write a lot of lines to the console in your measurements, by the way. I'd remove those Console.WriteLine calls and try again - they may well make a big difference.)

Resources