DeviceSumModuleF32 is broken - aleagpu

let sumModule = (new DeviceSumModuleF32(GPUModuleTarget.Worker(worker))).Create(2e2 |> int)
let t = worker.Malloc([|1.0f;1.0f;1.0f;1.0f;|])
let q = sumModule.Reduce(t.Ptr,4)
Without fail, the above code crashes with around 66% probability per run of the last line. I've tried varying the parameters, but it makes no difference. I think the DeviceSumModuleF32 might be broken.
let sumModule = (new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker),<# (+) #>)).Create(2e9 |> int)
let t = worker.Malloc([|1.0f;1.0f;1.0f;1.0f;|])
let q = sumModule.Reduce(t.Ptr,4)
The above works using DeviceReduceModule perfectly fine though.
See this post.
Edit: I should have written that instead of crashing, it goes into an infinite loop. Sorry about that.

I think this might be a bug in disposing the GPU module. Here is a workaround, by switch the CUDA context mode to "Threaded", and try to use "use" keyword to maintain the life time of GPU module (GPU module is a result of compilation, so it should be kept alive as long as possible to avoid re-compiling during runtime).
// workaround to use threaded cuda context mode
Alea.CUDA.Settings.Instance.Worker.DefaultContextType <- "threaded"
// compile GPU code and keep the module live for a long time
use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker),<# (+) #>)
// now get a reducer from reduce module.
// this reduce object includes some temp memories for algorithm
use reducer = reduceModule.Create(maxReduceNumber)
reducer.Reduce(....)

Related

How should I initialize an `Arc<[u8; 65536]>` efficiently?

I'm writing an application creating Arc objects of large arrays:
use std::sync::Arc
let buffer: Arc<[u8; 65536]> = Arc::new([0u8; 65536]);
After profiling this code, I've found that a memmove is occurring, making this slow. With other calls to Arc::new, the compiler seems smart enough to initialize the stored data without the memmove.
Believe it or not, the above code is faster than:
use std::sync::Arc;
use std::mem;
let buffer: Arc<[u8; 65536]> = Arc::new(unsafe {mem::uninitialized})
Which is a bit of a surprise.
Insights welcome, I expect this is a compiler issue.
Yeah, right now, you have to lean on optimizations, and apparently, it isn't doing it in this case. I'm not sure why.
We are also still working on placement new functionality, which will be able to let you explicitly tell the compiler you want to initialize this on the heap directly. See https://github.com/rust-lang/rfcs/pull/809 (and https://github.com/rust-lang/rfcs/pull/1228 which proposes changes that are inconsequential for this question). Once this is implemented, this should work:
let buffer: Arc<_> = box [0u8; 65536];

std::copy runtime_error when working with uint16_t's

I'm looking for input as to why this breaks. See the addendum for contextual information, but I don't really think it is relevant.
I have an std::vector<uint16_t> depth_buffer that is initialized to have 640*480 elements. This means that the total space it takes up is 640*480*sizeof(uint16_t) = 614400.
The code that breaks:
void Kinect360::DepthCallback(void* _depth, uint32_t timestamp) {
lock_guard<mutex> depth_data_lock(depth_mutex);
uint16_t* depth = static_cast<uint16_t*>(_depth);
std::copy(depth, depth + depthBufferSize(), depth_buffer.begin());/// the error
new_depth_frame = true;
}
where depthBufferSize() will return 614400 (I've verified this multiple times).
My understanding of std::copy(first, amount, out) is that first specifies the memory address to start copying from, amount is how far in bytes to copy until, and out is the memory address to start copying to.
Of course, it can be done manually with something like
#pragma unroll
for(auto i = 0; i < 640*480; ++i) depth_buffer[i] = depth[i];
instead of the call to std::copy, but I'm really confused as to why std::copy fails here. Any thoughts???
Addendum: the context is that I am writing a derived class that inherits from FreenectDevice to work with a Kinect 360. Officially the error is a Bus Error, but I'm almost certain this is because libfreenect interprets an error in the DepthCallback as a Bus Error. Stepping through with lldb, it's a standard runtime_error being thrown from std::copy. If I manually enter depth + 614400 it will crash, though if I have depth + (640*480) it will chug along. At this stage I am not doing something meaningful with the depth data (rendering the raw depth appropriately with OpenGL is a separate issue xD), so it is hard to tell if everything got copied, or just a portion. That said, I'm almost positive it doesn't grab it all.
Contrasted with the corresponding VideoCallback and the call inside of copy(video, video + videoBufferSize(), video_buffer.begin()), I don't see why the above would crash. If my understanding of std::copy were wrong, this should crash too since videoBufferSize() is going to return 640*480*3*sizeof(uint8_t) = 640*480*3 = 921600. The *3 is from the fact that we have 3 uint8_t's per pixel, RGB (no A). The VideoCallback works swimmingly, as verified with OpenGL (and the fact that it's essentially identical to the samples provided with libfreenect...). FYI none of the samples I have found actually work with the raw depth data directly, all of them colorize the depth and use an std::vector<uint8_t> with RGB channels, which does not suit my needs for this project.
I'm happy to just ignore it and move on in some senses because I can get it to work, but I'm really quite perplexed as to why this breaks. Thanks for any thoughts!
The way std::copy works is that you provide start and end points of your input sequence and the location to begin copying to. The end point that you're providing is off the end of your sequence, because your depthBufferSize function is giving an offset in bytes, rather than the number of elements in your sequence.
If you remove the multiply by sizeof(uint16_t), it will work. At that point, you might also consider calling std::copy_n instead, which takes the number of elements to copy.
Edit: I just realised that I didn't answer the question directly.
Based on my understanding of std::copy, it shouldn't be throwing exceptions with the input you're giving it. The only thing in that code that could throw a runtime_error is the locking of the mutex.
Considering you have undefined behaviour as a result of running off of the end of your buffer, I'm tempted to say that has something to do with it.

How to check which index in a loop is executing without slow down process?

What is the best way to check which index is executing in a loop without too much slow down the process?
For example I want to find all long fancy numbers and have a loop like
for( long i = 1; i > 0; i++){
//block
}
and I want to learn which i is executing in real time.
Several ways I know to do in the block are printing i every time, or checking if(i % 10000), or adding a listener.
Which one of these ways is the fastest. Or what do you do in similar cases? Is there any way to access the value of the i manually?
Most of my recent experience is with Java, so I'd write something like this
import java.util.concurrent.atomic.AtomicLong;
public class Example {
public static void main(String[] args) {
AtomicLong atomicLong = new AtomicLong(1); // initialize to 1
LoopMonitor lm = new LoopMonitor(atomicLong);
Thread t = new Thread(lm);
t.start(); // start LoopMonitor
while(atomicLong.get() > 0) {
long l = atomicLong.getAndIncrement(); // equivalent to long l = atomicLong++ if atomicLong were a primitive
//block
}
}
private static class LoopMonitor implements Runnable {
private final AtomicLong atomicLong;
public LoopMonitor(AtomicLong atomicLong) {
this.atomicLong = atomicLong;
}
public void run() {
while(true) {
try {
System.out.println(atomicLong.longValue()); // Print l
Thread.sleep(1000); // Sleep for one second
} catch (InterruptedException ex) {}
}
}
}
}
Most AtomicLong implementations can be set in one clock cycle even on 32-bit platforms, which is why I used it here instead of a primitive long (you don't want to inadvertently print a half-set long); look into your compiler / platform details to see if you need something like this, but if you're on a 64-bit platform then you can probably use a primitive long regardless of which language you're using. The modified for loop doesn't take much of an efficiency hit - you've replaced a primitive long with a reference to a long, so all you've added is a pointer dereference.
It won't be easy, but probably the only way to probe the value without affecting the process is to access the loop variable in shared memory with another thread. Threading libraries vary from one system to another, so I can't help much there (on Linux I'd probably use pthreads). The "monitor" thread might do something like probe the value once a minute, sleep()ing in between, and so allowing the first thread to run uninterrupted.
To have a null cost reporting (on multi-cpu computers) : set your index as a "global" property (class-wide for instance), and have a separate thread to read and report the index value.
This report could be timer-based (5 times per seconds or so).
Rq : Maybe you'll need also a boolean stating 'are we in the loop ?'.
Volatile and Caches
If you're going to be doing this in, say, C / C++ and use a separate monitor thread as previously suggested then you'll have to make the global/static loop variable volatile. You don't want the compiler decide deciding to use a register for the loop variable. Some toolchains make that assumption anyway, but there's no harm being explicit about it.
And then there's the small issue of caches. A separate monitor thread nowadays will end up on a separate core, and that'll mean that the two separate cache subsystems will have to agree on what the value is. That will unavoidably have a small impact on the runtime of the loop.
Real real time constraint?
So that begs the question of just how real time is your loop anyway? I doubt that your timing constraint is such that you're depending on it running within a specific number of CPU clock cycles. Two reasons, a) no modern OS will ever come close to guaranteeing that, you'd have to be running on the bare metal, b) most CPUs these days vary their own clock rate behind your back, so you can't count on a specific number of clock cycles corresponding to a specific real time interval.
Feature rich solution
So assuming that your real time requirement is not that constrained, you may wish to do a more capable monitor thread. Have a shared structure protected by a semaphore which your loop occasionally updates, and your monitor thread periodically inspects and reports progress. For best performance the monitor thread would take the semaphore, copy the structure, release the semaphore and then inspect/print the structure, minimising the semaphore locked time.
The only advantage of this approach over that suggested in previous answers is that you could report more than just the loop variable's value. There may be more information from your loop block that you'd like to report too.
Mutex semaphores in, say, C on Linux are pretty fast these days. Unless your loop block is very lightweight the runtime overhead of a single mutex is not likely to be significant, especially if you're updating the shared structure every 1000 loop iterations. A decent OS will put your threads on separate cores, but for the sake of good form you'd make the monitor thread's priority higher than the thread running the loop. This would ensure that the monitoring does actually happen if the two threads do end up on the same core.

Is this a proper thread-safe Random wrapper?

I am fairly inexperienced with threading and concurrency; to remedy that, I am currently working for fun on implementing a random-search algorithm in F#. I wrote a wrapper around the System.Random class, following ideas from existing C# examples - but as I am not sure how I would even begin to unit test this for faulty behavior, I'd like to hear what more experienced minds have to say, and if there are obvious flaws or improvements with my code, either due to F# syntax or threading misunderstanding:
open System
open System.Threading
type Probability() =
static let seedGenerator = new Random()
let localGenerator =
new ThreadLocal<Random>(
fun _ ->
lock seedGenerator (
fun _ ->
let seed = seedGenerator.Next()
new Random(seed)))
member this.Draw() =
localGenerator.Value.NextDouble()
My understanding of what this does: ThreadLocal ensures that for an instance, each thread receives its own instance of a Random, with its own random seed provided by a common, static Random. That way, even if multiple instances of the class are created close in time, they will receive their own seed, avoiding the problem of "duplicate" random sequences. The lock enforces that no two threads will get the same seed.
Does this look correct? Are there obvious issues?
I think your approach is pretty reasonable - using ThreadLocal gives you safe access to the Random and using a master random number generator to provide seeds means that you'll get random values even if you access it from multiple threads at similar time. It may not be random in the cryptographical sense, but should be fine for most other applications.
As for testing, this is quite tricky. If Random breaks, it will return 0 all the time, but that's just empirical experience and it is hard to say for how long you need to keep accessing it unsafely. The best thing I can suggest is to implement some simple randomness tests (some simple ones are on WikiPedia) and access your type from multiple threads in a loop - though this is still quite bad test as it may not fail every time.
Aside, you don't need to use type to encapsulate this behaviour. It can be written as a function too:
open System
open System.Threading
module Probability =
let Draw =
// Create master seed generator and thread local value
let seedGenerator = new Random()
let localGenerator = new ThreadLocal<Random>(fun _ ->
lock seedGenerator (fun _ ->
let seed = seedGenerator.Next()
new Random(seed)))
// Return function that uses thread local random generator
fun () ->
localGenerator.Value.NextDouble()
This feels wrong-headed. Why not just use a singleton (only ever create one Random instance, and lock it)?
If real randomness is a concern, then maybe see RNGCryptoServiceProvider, which is threadsafe.
Unless there is a performance bottleneck, I think something like
let rand = new Random()
let rnext() =
lock rand (
fun () ->
rand.next())
will be easier to understand, but I think your method should be fine.
If you really want to go with the OO approach, then your code may be fine (I won't say 'it is' fine as I am not too smart to understand OO :) ). But in case you want to go the functional way it would be as simple as something like:
type Probability = { Draw : unit -> int }
let probabilityGenerator (n:int) =
let rnd = new Random()
Seq.init n (fun _ -> new Random(rnd.Next()))
|> Seq.map (fun r -> { Draw = fun () -> r.Next() })
|> Seq.toList
Here you can use the function probabilityGenerator to generate as much as "Porbability" type object and then distribute them to various threads which can work on them in parallel.
The important thing here is that we are not introducing lock etc in the core type i.e probability and it becomes the responsibility of the consumer how they want to distribute it across threads.

D Dynamic Arrays - RAII

I admit I have no deep understanding of D at this point, my knowledge relies purely on what documentation I have read and the few examples I have tried.
In C++ you could rely on the RAII idiom to call the destructor of objects on exiting their local scope.
Can you in D?
I understand D is a garbage collected language, and that it also supports RAII.
Why does the following code not cleanup the memory as it leaves a scope then?
import std.stdio;
void main() {
{
const int len = 1000 * 1000 * 256; // ~1GiB
int[] arr;
arr.length = len;
arr[] = 99;
}
while (true) {}
}
The infinite loop is there so as to keep the program open to make residual memory allocations easy visible.
A comparison of a equivalent same program in C++ is shown below.
It can be seen that C++ immediately cleaned up the memory after allocation (the refresh rate makes it appear as if less memory was allocated), whereas D kept it even though it had left scope.
Therefore, when does the GC cleanup?
scope declarations are going in D2, so I'm not terribly certain on the semantics, but what I'd imagine is happening is that scope T[] a; only allocates the array struct on the stack (which needless to say, already happens, regardless of scope). As they are going, don't use scope (using scope(exit) and friends is different -- keep using them).
Dynamic arrays always use the GC to allocate their memory -- there's no getting around that. If you want something more deterministic, using std.container.Array would be the simplest manner, as I think you could pretty much drop it in where your scope vector3b array is:
Array!vector3b array
Just don't bother setting the length to zero -- the memory will be free'd once it goes out of scope (Array uses malloc/free from libc under the hood).
No, you cannot assume that the garbage collector will collect your object at any point in time.
There is, however, a delete keyword (as well as a scope keyword) that can delete an object deterministically.
scope is used like:
{
scope auto obj = new int[5];
//....
} //obj cleaned up here
and delete is used like in C++ (there's no [] notation for delete).
There are some gotcha's, though:
It doesn't always work properly (I hear it doesn't work well with arrays)
The developers of D (e.g. Andrei) are intending to remove them in later versions, because it can obviously mess up things if used incorrectly. (I personally hate this, given that it's so easy to screw things up anyway, but they're sticking with removing it, and I don't think people can convince them otherwise although I'd love it if that was the case.)
In its place, there is already a clear method that you can use, like arr.clear(); however, I'm not quite sure what it exactly does yet myself, but you could look at the source code in object.d in the D runtime if you're interested.
As to your amazement: I'm glad you're amazed, but it shouldn't be really surprising considering that they're both native code. :-)

Resources