Running a kernel with CUDAfy .NET using OpenCL asynchronously

Running a kernel with CUDAfy .NET using OpenCL asynchronously - visual-studio

I am trying to make asynchronous kernel calls to my GPGPU using CUDAfy .NET.
When I pass values to the kernel and copy them back to the host, I do not always get the value I expect.
I have a structure Foo with a byte Bar:
[Cudafy]
public struct Foo {
public byte Bar;
}
And I have a kernel I want to call:
[Cudafy]
public static void simulation(GThread thread, Foo[] f)
{
f[0].Bar = 3;
thread.SyncThreads();
}
I have a single thread with streamID = 1 (I tried using multiple threads, and noticed the issue. Reducing to a single thread didn't seem to fix the issue though).
//allocate
streamID = 1;
count = 1;
gpu.CreateStream(streamID);
Foo[] sF = new Foo[count];
IntPtr hF = gpu.HostAllocate<Foo>(count);
Foo[] dF = gpu.Allocate<Foo>(sF);
while (true)
{
//set value
sF[0].Bar = 1;
byte begin = sF[0].Bar;
//host -> pinned
GPGPU.CopyOnHost<Foo>(sF, 0, hF, 0, count);
sF[0].Bar = 2;
lock (gpu)
{
//pinned -> device
gpu.CopyToDeviceAsync<Foo>(hF, 0, dF, 0, count, streamID);
//run
gpu.Launch().simulation(dF);
//device -> pinned
gpu.CopyFromDeviceAsync<Foo>(dF, 0, hF, 0, count, streamID);
}
//WAIT
gpu.SynchronizeStream(streamID);
//pinned -> host
GPGPU.CopyOnHost<Foo>(hF, 0, sF, 0, count);
byte end = sF[0].Bar;
}
//de-allocate
gpu.Free(dF);
gpu.HostFree(hF);
gpu.DestroyStream(streamID);
First I create a stream on the GPU.
I am creating a regular structure Foo array of size 1 (sF) and setting it's Bar value to 1. Then I create pinned memory on the host (hF) for Foo as well. I also create memory on the device for Foo (dF).
I initialize the structure's Bar value to 1 then I copy it to the pinned memory (As a check, I set the value to 2 for the structure after copying to pinned, you'll see why later). Then I use a lock to ensure I have full access to the GPU and I queue a copy to dF, a run for the kernel, and a copy from dF. At this point I don't know when this will all actually run on the GPU... so I can call SynchronizeStream to wait on the host until the device is done.
When it's done, I can copy the pinned memory (hF) to the shared memory (sF). When I get the value, it's usually a 3 (which was set on the device) or a 1 (which means either the value wasn't set in the kernel, or the new value wasn't copied to the pinned memory). I do know that the pinned memory is copied to the structure because the structure never has the value of 2.
Over many runs, a small percentage is runs results in something other than begin=1 and end=3. It would always be begin=1, end=1 and it happens about 5-10% of the time.
I have no idea why this happens. I know it generally highlights a race condition, but by calling the sync calls, I would expect the async calls to work in a predictable fashion.
Why would I be encountering this kind of issue with this code?
Thank you so much!
-Phil

I just figured out the issue that was occurring. While the launch was being done asynchronously... I didn't include the stream for the launch.
Changing my launch to be:
gpu.Launch(gridsize,blocksize,streamID).simulation(dF);
resolved the problem. It seems that the launches were occurring on stream 0 and the stream 1 and 2 were being synced. So sometimes the data gets set, sometimes it doesn't. A race condition.

Related

Memory loads experience different latency on the same core

I am trying to implement a cache based covert channel in C but noticed something weird. The physical address between the sender and the receiver is shared by using the mmap() call that maps to the same file with the MAP_SHARED option. Below is the code for the sender process which flushes an address from the cache to transmit a 1 and loads an address into the cache to transmit a 0. It also measures the latency of a load in both cases:
// computes latency of a load operation
static inline CYCLES load_latency(volatile void* p) {
CYCLES t1 = rdtscp();
load = *((int *)p);
CYCLES t2 = rdtscp();
return (t2-t1);
}
void send_bit(int one, void *addr) {
if(one) {
clflush((void *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
clflush((void *)addr);
}
else {
x = *((int *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
}
}
int main(int argc, char **argv) {
if(argc == 2)
{
bit = atoi(argv[1]);
}
// transmit bit
init_address(DEFAULT_FILE_NAME);
send_bit(bit, address);
return 0;
}
The load operation takes around 0 - 1000 cycles (during a cache-hit and a cache-miss) when issued by the same process.
The receiver program loads the same shared physical address and measures the latency during a cache-hit or a cache-miss, the code for which has been shown below:
int main(int argc, char **argv) {
init_address(DEFAULT_FILE_NAME);
rdtscp();
load__latency = load_latency((void *)address);
printf("load latency = %d\n", load__latency);
return 0;
}
(I ran the receiver manually after the sender process terminated)
However, the latency observed in this scenario is very much different as compared to the first case. The load operation takes around 5000-1000 cycles.
Both the processes have been pinned to the same core-id by using the taskset command. So if I'm not wrong, during a cache-hit, both processes will experience a load latency of the L1-cache on a cache-hit and DRAM on a cache-miss. Yet, these two processes experience a very different latency. What could be the reason for this observation, and how can I have both the processes experience the same amount of latency?

The initial access to an mmaped region will page-fault (lazy mapping/allocation by the kernel), unless you use mmap(MAP_POPULATE), or mlock, or touch some other cache line of the page first.
You're probably timing a page fault if you only do one time measurement per mmap, or per run of a whole program.
(Also you don't seem to be doing anything to warm up the CPU frequency, so once core cycle could be many reference cycles. Some of the time for an L3 miss is fixed in terms of memory clock cycles, but another part of it scales with core/uncore clock.)
Also note that unless you run the 2nd process immediately (e.g. from the same shell command), the OS will get a chance to put that core into a deep sleep. On Intel CPUs at least, that empties L1d and L2 so it can power them down in the deeper C states. Probably also the TLBs.
It's also strange that you you cast away volatile in load = *((int *)p);
Assigning the load result to a global(?) variable inside the timed region is also pretty questionable; that could also soft page fault. If so, RDTSCP will have to wait for it, because the store can't retire.
(But on TLB hit for the store, it doesn't have to wait for it commit to cache, since there's no mfence before rdtscp. A store can retire while the store is still in the store buffer. In fact it must retire before the store buffer entry is known to be non-speculative so it can commit.)

Enabling Closed-Display Mode w/o Meeting Apple's Requirements

EDIT:
I have heavily edited this question after making some significant new discoveries and the question not having any answers yet.
Historically/AFAIK, keeping your Mac awake while in closed-display mode and not meeting Apple's requirements, has only been possible with a kernel extension (kext), or a command run as root. Recently however, I have discovered that there must be another way. I could really use some help figuring out how to get this working for use in a (100% free, no IAP) sandboxed Mac App Store (MAS) compatible app.
I have confirmed that some other MAS apps are able to do this, and it looks like they might be writing YES to a key named clamshellSleepDisabled. Or perhaps there's some other trickery involved that causes the key value to be set to YES? I found the function in IOPMrootDomain.cpp:
void IOPMrootDomain::setDisableClamShellSleep( bool val )
{
if (gIOPMWorkLoop->inGate() == false) {
gIOPMWorkLoop->runAction(
OSMemberFunctionCast(IOWorkLoop::Action, this, &IOPMrootDomain::setDisableClamShellSleep),
(OSObject *)this,
(void *)val);
return;
}
else {
DLOG("setDisableClamShellSleep(%x)\n", (uint32_t) val);
if ( clamshellSleepDisabled != val )
{
clamshellSleepDisabled = val;
// If clamshellSleepDisabled is reset to 0, reevaluate if
// system need to go to sleep due to clamshell state
if ( !clamshellSleepDisabled && clamshellClosed)
handlePowerNotification(kLocalEvalClamshellCommand);
}
}
}
I'd like to give this a try and see if that's all it takes, but I don't really have any idea about how to go about calling this function. It's certainly not a part of the IOPMrootDomain documentation, and I can't seem to find any helpful example code for functions that are in the IOPMrootDomain documentation, such as setAggressiveness or setPMAssertionLevel. Here's some evidence of what's going on behind the scenes according to Console:
I've had a tiny bit of experience working with IOMProotDomain via adapting some of ControlPlane's source for another project, but I'm at a loss for how to get started on this. Any help would be greatly appreciated. Thank you!
EDIT:
With #pmdj's contribution/answer, this has been solved!
Full example project:
https://github.com/x74353/CDMManager
This ended up being surprisingly simple/straightforward:
1. Import header:
#import <IOKit/pwr_mgt/IOPMLib.h>
2. Add this function in your implementation file:
IOReturn RootDomain_SetDisableClamShellSleep (io_connect_t root_domain_connection, bool disable)
{
uint32_t num_outputs = 0;
uint32_t input_count = 1;
uint64_t input[input_count];
input[0] = (uint64_t) { disable ? 1 : 0 };
return IOConnectCallScalarMethod(root_domain_connection, kPMSetClamshellSleepState, input, input_count, NULL, &num_outputs);
}
3. Use the following to call the above function from somewhere else in your implementation:
io_connect_t connection = IO_OBJECT_NULL;
io_service_t pmRootDomain = IOServiceGetMatchingService(kIOMasterPortDefault, IOServiceMatching("IOPMrootDomain"));
IOServiceOpen (pmRootDomain, current_task(), 0, &connection);
// 'enable' is a bool you should assign a YES or NO value to prior to making this call
RootDomain_SetDisableClamShellSleep(connection, enable);
IOServiceClose(connection);

I have no personal experience with the PM root domain, but I do have extensive experience with IOKit, so here goes:
You want IOPMrootDomain::setDisableClamShellSleep() to be called.
A code search for sites calling setDisableClamShellSleep() quickly reveals a location in RootDomainUserClient::externalMethod(), in the file iokit/Kernel/RootDomainUserClient.cpp. This is certainly promising, as externalMethod() is what gets called in response to user space programs calling the IOConnectCall*() family of functions.
Let's dig in:
IOReturn RootDomainUserClient::externalMethod(
uint32_t selector,
IOExternalMethodArguments * arguments,
IOExternalMethodDispatch * dispatch __unused,
OSObject * target __unused,
void * reference __unused )
{
IOReturn ret = kIOReturnBadArgument;
switch (selector)
{
…
…
…
case kPMSetClamshellSleepState:
fOwner->setDisableClamShellSleep(arguments->scalarInput[0] ? true : false);
ret = kIOReturnSuccess;
break;
…
So, to invoke setDisableClamShellSleep() you'll need to:
Open a user client connection to IOPMrootDomain. This looks straightforward, because:
Upon inspection, IOPMrootDomain has an IOUserClientClass property of RootDomainUserClient, so IOServiceOpen() from user space will by default create an RootDomainUserClient instance.
IOPMrootDomain does not override the newUserClient member function, so there are no access controls there.
RootDomainUserClient::initWithTask() does not appear to place any restrictions (e.g. root user, code signing) on the connecting user space process.
So it should simply be a case of running this code in your program:
io_connect_t connection = IO_OBJECT_NULL;
IOReturn ret = IOServiceOpen(
root_domain_service,
current_task(),
0, // user client type, ignored
&connection);
Call the appropriate external method.
From the code excerpt earlier on, we know that the selector must be kPMSetClamshellSleepState.
arguments->scalarInput[0] being zero will call setDisableClamShellSleep(false), while a nonzero value will call setDisableClamShellSleep(true).
This amounts to:
IOReturn RootDomain_SetDisableClamShellSleep(io_connect_t root_domain_connection, bool disable)
{
uint32_t num_outputs = 0;
uint64_t inputs[] = { disable ? 1 : 0 };
return IOConnectCallScalarMethod(
root_domain_connection, kPMSetClamshellSleepState,
&inputs, 1, // 1 = length of array 'inputs'
NULL, &num_outputs);
}
When you're done with your io_connect_t handle, don't forget to IOServiceClose() it.
This should let you toggle clamshell sleep on or off. Note that there does not appear to be any provision for automatically resetting the value to its original state, so if your program crashes or exits without cleaning up after itself, whatever state was last set will remain. This might not be great from a user experience perspective, so perhaps try to defend against it somehow, for example in a crash handler.

Workaround for a certain IOServiceOpen() call requiring root privileges

Background
It is possible to perform a software-controlled disconnection of the power adapter of a Mac laptop by creating an DisableInflow power management assertion.
Code from this answer to an SO question can be used to create said assertion. The following is a working example that creates this assertion until the process is killed:
#include <IOKit/pwr_mgt/IOPMLib.h>
#include <unistd.h>
int main()
{
IOPMAssertionID neverSleep = 0;
IOPMAssertionCreateWithName(kIOPMAssertionTypeDisableInflow,
kIOPMAssertionLevelOn,
CFSTR("disable inflow"),
&neverSleep);
while (1)
{
sleep(1);
}
}
This runs successfully and the power adapter is disconnected by software while the process is running.
What's interesting, though, is that I was able to run this code as a regular user, without root privileges, which wasn't supposed to happen. For instance, note the comment in this file from Apple's open source repositories:
// Disables AC Power Inflow (requires root to initiate)
#define kIOPMAssertionTypeDisableInflow CFSTR("DisableInflow")
#define kIOPMInflowDisableAssertion kIOPMAssertionTypeDisableInflow
I found some code which apparently performs the actual communication with the charger; it can be found here. The following functions, from this file, appears to be of particular interest:
IOReturn
AppleSmartBatteryManagerUserClient::externalMethod(
uint32_t selector,
IOExternalMethodArguments * arguments,
IOExternalMethodDispatch * dispatch __unused,
OSObject * target __unused,
void * reference __unused )
{
if (selector >= kNumBattMethods) {
// Invalid selector
return kIOReturnBadArgument;
}
switch (selector)
{
case kSBInflowDisable:
// 1 scalar in, 1 scalar out
return this->secureInflowDisable((int)arguments->scalarInput[0],
(int *)&arguments->scalarOutput[0]);
break;
// ...
}
// ...
}
IOReturn AppleSmartBatteryManagerUserClient::secureInflowDisable(
int level,
int *return_code)
{
int admin_priv = 0;
IOReturn ret = kIOReturnNotPrivileged;
if( !(level == 0 || level == 1))
{
*return_code = kIOReturnBadArgument;
return kIOReturnSuccess;
}
ret = clientHasPrivilege(fOwningTask, kIOClientPrivilegeAdministrator);
admin_priv = (kIOReturnSuccess == ret);
if(admin_priv && fOwner) {
*return_code = fOwner->disableInflow( level );
return kIOReturnSuccess;
} else {
*return_code = kIOReturnNotPrivileged;
return kIOReturnSuccess;
}
}
Note how, in secureInflowDisable(), root privileges are checked for prior to running the code. Note also this initialization code in the same file, again requiring root privileges, as explicitly pointed out in the comments:
bool AppleSmartBatteryManagerUserClient::initWithTask(task_t owningTask,
void *security_id, UInt32 type, OSDictionary * properties)
{
uint32_t _pid;
/* 1. Only root processes may open a SmartBatteryManagerUserClient.
* 2. Attempts to create exclusive UserClients will fail if an
* exclusive user client is attached.
* 3. Non-exclusive clients will not be able to perform transactions
* while an exclusive client is attached.
* 3a. Only battery firmware updaters should bother being exclusive.
*/
if ( kIOReturnSuccess !=
clientHasPrivilege(owningTask, kIOClientPrivilegeAdministrator))
{
return false;
}
// ...
}
Starting from the code from the same SO question above (the question itself, not the answer), for the sendSmartBatteryCommand() function, I wrote some code that calls the function passing kSBInflowDisable as the selector (the variable which in the code).
Unlike the code using assertions, this one only works as root. If running as a regular user, IOServiceOpen() returns, weirdly enough, kIOReturnBadArgument (not kIOReturnNotPrivileged, as I would have expected). Perhaps this has to do with the initWithTask() method above.
The question
I need to perform a call with a different selector to this same Smart Battery Manager kext. Even so, I can't even get to the IOConnectCallMethod() since IOServiceOpen() fails, presumably because the initWithTask() method prevents any non-root users from opening the service.
The question, therefore, is this: how is IOPMAssertionCreateWithName() capable of creating a DisableInflow assertion without root privileges?
The only possibility I can think of is if there's a root-owned process to which requests are forwarded, and which performs the actual work of calling IOServiceOpen() and later IOConnectCallMethod() as root.
However, I'm hoping there's a different way of calling the Smart Battery Manager kext which doesn't require root (one that doesn't involve the IOServiceOpen() call.) Using IOPMAssertionCreateWithName() itself is not possible in my application, since I need to call a different selector within that kext, not the one that disables inflow.
It's also possible this is in fact a security vulnerability, which Apple will now fix in a future release as soon as it is alerted to this question. That would be too bad, but understandable.
Although running as root is a possibility in macOS, it's obviously desirable to avoid privilege elevation unless absolutely necessary. Also, in the future I'd like to run the same code under iOS, where it's impossible to run anything as root, in my understanding (note this is an app I'm developing for my own personal use; I understand linking to IOKit wipes out any chance of getting the app published in the App Store).

Why doesn't device_create return error when a file already exists?

I am writing a PCI driver with a character device for an interface (Linux 4.9.13). Here's the scenario that bothers me:
Run touch /dev/foo0 which creates a normal file in the /dev directory.
Load the driver module. Here's a pseudo code representing what happens there (pretty standard character device registration):
// When the module is initialized:
alloc_chrdev_region(&dev, 0, 256, "foo");
class = class_create(THIS_MODULE, "foo");
// Later, when a suitable PCI device is connected the probe function
// calls the following functions:
cdev_init(dev->md_cdev, &fops);
dev->md_devnum = MKDEV(major, 0 + index);
res = cdev_add(dev->md_cdev, dev->md_devnum, 1);
dev->md_sysfsdev = device_create(class, 0, dev->md_devnum, 0, "foo%d", index);
Details:
index is just another free index
What seems weird to me is nothing raises an error that there is already a /dev/foo0 file which is not a character device. I do check all the errors (I think so) but I omitted related code for the sake of conciseness. Everything works as expected if I do not run touch /dev/foo0. Otherwise, I can neither read nor write to the device.
Why is it so? Shouldn't device_create return an error or at least create /dev/foo1 instead?

blk_cleanup_queue() doesn't return on block device deregistration

I'm writing a block device driver for a hot-pluggable PCI memory device on 2.6.43.2-6.fc15 (so LDD3 is out of date with respect to a lot of functions) and I'm having trouble getting the block device de-registration to go smoothly. When the device is removed, I go to tear down the gendisk and request_queue, but it hangs on blk_cleanup_queue(). Presumably there's some queue-related process I have neglected to carry out before that, but I can't see any major consistent differences with other block drivers from that kernel tree that i am using for reference (memstick, cciss, etc). What are the steps I should carry out before going to tidy up the queue and gendisk?
I am implementing .open, .release, .ioctl in the block_ops as well as a mydev_request(struct request_queue *q) attached with blk_init_queue(mydev_request, &mydev->lock), but I'm not sure exactly how to tidy the queue either when requests occur or when de-registering the block device.

This is caused by not ending the requests that you fetch off the queue. To fix it, end the request as follows:
while ((req = blk_fetch_request(q)) != NULL )
{
res = mydev_submit_request_sg(mydev, req);
if (res)
__blk_end_request_all(req, res);
else
__blk_end_request_cur (req, res);
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Running a kernel with CUDAfy .NET using OpenCL asynchronously - visual-studio

Related

Memory loads experience different latency on the same core

Enabling Closed-Display Mode w/o Meeting Apple's Requirements

Workaround for a certain IOServiceOpen() call requiring root privileges

Why doesn't device_create return error when a file already exists?

blk_cleanup_queue() doesn't return on block device deregistration

Categories

Resources