I'm just starting to use Julia's CUDArt package to manage GPU computing. I am wondering how to ensure that if I go to pull data from the gpu (e.g. using to_host()) that I don't do so before all of the necessary computations have been performed on it.
Through some experimentation, it seems that to_host(CudaArray) will lag while the particular CudaArray is being updated. So, perhaps just using this is enough to ensure safety? But it seems a bit chancy.
Right now, I am using the launch() function to run my kernels, as depicted in the package documentation.
The CUDArt documentation gives an example using Julia's #sync macro, which seems like it could be lovely. But for the purposes of #sync I am done with my "work" and ready to move on as soon as the kernel gets launched with launch(), not once it finishes. As far as I understand the operation of launch() - there isn't a way to change this feature (e.g. to make it wait to receive the output of the function it "launches").
How can I accomplish such synchronization?
Ok, so, there isn't a ton of documentation on the CUDArt package, but I looked at the source code and I think it looks straightforward on how to do this. In particular, it appears that there is a device_synchronize() function that will block until all of the work on the currently active device has finished. Thus, the following in particular seems to work:
using CUDArt
md = CuModule("/path/to/module.ptx",false)
MyFunc = CuFunction(md,"MyFunc")
GridDim = 2*2496
BlockDim = 64
launch(MyFunc, GridDim, BlockDim, (arg1, arg2, ...));
device_synchronize()
res = to_host(arg2)
I'd love to hear from anyone with more expertise though if there is anything more to be aware of here.
I think the more canonical way is to make a stream for each device:
streams = [(device(dev); Stream()) for dev in devlist]
and then inside the #async block, after you tell it to do the computations, you use the wait(stream) function to tell it to wait for that stream to finish its computations. See the Streams example in the README.
Related
This questions deals with first load performance of Julia
I am running a Julia program from command line. The program is intended to be an API where user of the API doesn't have to initialize internal objects.
module Internal
type X
c::T1
d::T2
.
..
end
function do_something(a::X, arg:Any)
#some function
end
export do_something
end
API.jl
using Internal
const i_of_X = X()
function wrapper_do_something(args::Any)
do_something(i_of_X, args)
end
Now, this API.jl is exposed to third party user so that they don't have to bother about instantiating the internal objects. However, API.jl is not a module and hence cannot be precompiled. As there are many functions in API.jl, the first load takes a "very" long time.
Is there a way to improve the performance? I tried wrapping API.jl in a module too but I don't know if wrapping const initialized variables in a module is the way to go. I also get segmentation fault on doing so (some of the const are database connections and database collections along with other complex objects).
I am using v0.5 on OSX
[EDIT]
I did wrap API.jl in a module but there is no performance improvement.
I digged deeper and a big performance hit comes from the first call to linear regression function (GLM module based OLS lm(y~X, df)). The df has only 2 columns and 3 rows so it's not the run time issues but compilation slowness.
The other big hit comes from calling a highly overloaded function. The overloaded function fetches data from the database and can accept variety of input formats.
Is there a way to speed these up? Is there a way to fully precompile the julia program?
For a little more background, API based program is called once via command-line and any persistent first compilation advantages are lost as command-line closes the Julia process.
$julia run_api_based_main_func.jl
One hacky way to use the compilation benefits is to somehow copy/paste the code in already active julia process. Is this doable/recommended?(I am desperate to make it fast. Waiting 15-20s for a 2s analysis doesn't seem right)
It is OK to wrap const values in a module. They can be exported, as need be.
As Fengyang said, wrapping independent components of a larger design in modules is helpful and will help in this situation. When there is a great deal going on inside a module, the precompile time that accompanies each initial function call can add up. There is a way to avoid that -- precompile the contents before using the module:
__precompile__(true)
module ModuleName
# ...
end # module ModuleName
Please Note (from the online help):
__precompile__() should not be used in a module unless all of its dependencies are also using __precompile__(). Failure to do so can
result in a runtime error when loading the module.
It's possible to write parallel code in Elm? Elm is pure functional, so no locking is needed. Of course, I can use Javascript FFI, spawn workers here and do it on my own. But, I want more user friendly "way" of doing this.
Short answer
No, not currently. But the next release (0.15) release will have new ways to handle effects inside Elm so you will need to use ports + JavaScript code less. So there may well be a way to spawn workers inside Elm in the next version.
More background
If you're feeling adventurous, try reading the published paper on Elm (or the longer original thesis), which shows that the original flavour of FRP that Elm uses is well suited for fine-grained concurrency. There is also an async construct which can potentially make part of the program run separately in a more coarse-grained manner. That might be support with OS-level threads (like JS Webworkers) and parallelism.
There have been earlier experiments with Webworkers. There is certainly an interest in concurrency within the community, but JavaScript doesn't offer (m)any great options for concurrency.
For reading tips on the paper, here's post of mine from the elm-discuss mailing list:
If you want to know more about signals and opt-in async, I suggest you try Evan's PLDI paper on Elm. Read from the introduction (1) up to building GUIs (4). You can skip the type system (3.2) and functional evaluation (3.3.1), that may save you some time. Most in and after building GUIs (4) is probably stuff you know already. Figure 8 is probably the best overview of what the async keyword does (note that the async keyword is not implemented in the current Elm compiler).
I just noticed today that this method, Flush() is available.
Not able to find detailed documentation on it.
What exactly does this do?
Is this required?
gl.flush in WebGL does have it's uses but it's driver and browser specific.
For example, because Chrome's GPU architecture is multi-process you can do this
function var loadShader = function(gl, shaderSource, shaderType) {
var shader = gl.createShader(shaderType);
gl.shaderSource(shader, shaderSource);
gl.compileShader(shader);
return shader;
}
var vs = loadShader(gl, someVertexShaderSource, gl.VERTEX_SHADER);
var fs = loadShader(gl, someFragmentShaderSource, FRAGMENT_SHADER);
var p = gl.createProgram();
gl.attachShader(p, vs);
gl.attachShader(p, fs);
gl.linkProgram(p);
At this point all of the commands might be sitting in the command
queue with nothing executing them yet. So, issue a flush
gl.flush();
Now, because we know that compiling and linking programs is slow depending on how large and complex they are so we can wait a while before trying using them and do other stuff
setTimeout(continueLater, 1000); // continue 1 second later
now do other things like setup the page or UI or something
1 second later continueLater will get called. It's likely our shaders finished compiling and linking.
function continueLater() {
// check results, get locations, etc.
if (!gl.getShaderParameter(vs, gl.COMPILE_STATUS) ||
!gl.getShaderParameter(fs, gl.COMPILE_STATUS) ||
!gl.getProgramParameter(p, gl.LINK_STATUS)) {
alert("shaders didn't compile or program didn't link");
...etc...
}
var someLoc = gl.getUniformLocation(program, "u_someUniform");
...etc...
}
I believe Google Maps uses this technique as they have to compile many very complex shaders and they'd like the page to stay responsive. If they called gl.compileShader or gl.linkProgram and immediately called one of the query functions like gl.getShaderParameter or gl.getProgramParameter or gl.getUniformLocation the program would freeze while the shader is first validated and then sent to the driver to be compiled. By not doing the querying immediately but waiting a moment they can avoid that pause in the UX.
Unfortunately this only works for Chrome AFAIK because other browsers are not multi-process and I believe all drivers compile/link synchronously.
There maybe be other reasons to call gl.flush but again it's very driver/os/browser specific. As an example let's say you were going to draw 1000 objects and to do that took 5000 webgl calls. It likely would require more than that but just to have a number lets pick 5000. 4 calls to gl.uniformXXX and 1 calls to gl.drawXXX per object.
It's possible all 5000 of those calls fit in the browser's (Chrome) or driver's command buffer. Without a flush they won't start executing until the the browser issues a gl.flush for you (which it does so it can composite your results on the screen). That means the GPU might be sitting idle while you issue 1000, then 2000, then 3000, etc.. commands since they're just sitting in a buffer. gl.flush tells the system "Hey, those commands I added, please make sure to start executing them". So you might decide to call gl.flush after each 1000 commands.
The problem though is gl.flush is not free otherwise you'd call it after every command to make sure it executes as soon as possible. On top of that each driver/browser works in different ways. On some drivers calling gl.flush every few 100 or 1000 WebGL calls might be a win. On others it might be a waste of time.
Sorry, that was probably too much info :p
Assuming it's semantically equivalent to the classic GL glFlush then no, it will almost never be required. OpenGL is an asynchronous API — you queue up work to be done and it is done when it can be. glFlush is still asynchronous but forces any accumulated buffers to be emptied as quickly as they can be, however long that may take; it basically says to the driver "if you were planning to hold anything back for any reason, please don't".
It's usually done only for a platform-specific reason related to the coupling of OpenGL and the other display mechanisms on that system. For example, one platform might need all GL work to be ordered not to queue before the container that it will be drawn into can be moved to the screen. Another might allow piping of resources from a background thread into the main OpenGL context but not guarantee that they're necessarily available for subsequent calls until you've flushed (e.g. if multithreading ends up creating two separate queues where there might otherwise be one then flush might insert a synchronisation barrier across both).
Any platform with a double buffer or with back buffering as per the WebGL model will automatically ensure that buffers proceed in a timely manner. Queueing is to aid performance but should have no negative observable consequences. So you don't have to do anything manually.
If you decline to flush and don't strictly need to even when you semantically perhaps should, but your graphics are predicated on real-time display anyway, then you're probably going to be suffering at worst a fraction of a second of latency.
I'm trying to use a coroutine function outside of the event loop. (In this case, I want to call a function in Django that could also be used inside the event loop too)
There doesn't seem to be a way to do this without making the calling function a coroutine.
I realize that Django is built to be blocking and a therefore incompatible with asyncio. Though I think that this question might help people who are making the transition or using legacy code.
For that matter, it might help to understand async programming and why it doesn't work with blocking code.
After a lot of research I think these solutions could be helpful:
Update your legacy code to use asyncio:
Yes i know that it can be hard and painful, but it might be the sanest choice. If you are wanting to use Django like I was... Well, you've got a lot of work to do to make Django async'd. I'm not sure it is possible, but I found at least one attempt: https://github.com/aaugustin/django-c10k-demo (Though, in a youtube video the author explained all the shortcomings of this).
use asyncio.async or asyncio.Task:
These items will enable you to run something async inside of blocking code, but the downfall here is that you will not be able to wait for them to finish without doing something ugly like a while loop that checks if the future has completed... ugh, but if you don't need the result, that might work for you.
About case #2: Blocking code should be at least wrapped with .run_in_executor.
How can I find out how many times per second a particular function is called using OllyDbg? Alternatively, how can I count the total number of times the EIP has a certain value?
I don't want OllyDbg to break on executing this code.
I believe you can accomplish this by using the trace feature, with the conditional logging, the new OllyDBG seems to implement tracing with some better features. Its been a while since I've done this, but you can check it out. But I would actually suggest using Immunity Debugger(an OllyDBG clone with python plugin interface) and maybe write a python script to do this.