How can one share depth images between two processes? - image

I have 4 different depth cameras available to me: Kinect, Xtion, PMD nano, Softkinetic DepthSense.
I have the libraries that know how to read all of them: OpenNI, PMD drivers, Softkinetic drivers.
I would ideally like to make a simple grabber for each kind of camera and then just use it as a plugin into any other program i.e. get fast, non redundant access (i.e. not too many memory copies) to the data stream.
One of the problems is that in many cases I dont have the right library in 32 or 64 bit so I cant compile all grabbers in the same project.
What is the best way to achieve this?

I am a researcher so this idea isnt necessarily useful for production code but given this scenario my best solution has been to create a server process for each type of camera. Each server process knows how to load its own type of camera stream and then throws it into a shared memory space that other processes can read from.
It is obviously possible to use different kind of locking mechanisms but I have left the below code without any locks.
The server process will include the following:
#define BOOST_ALL_NO_LIB
#include <boost/interprocess/shared_memory_object.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/interprocess/sync/scoped_lock.hpp>
#include <boost/interprocess/sync/interprocess_mutex.hpp>
using namespace std;
using namespace boost::interprocess;
struct sharedImage
{
enum { width = 320 };
enum { height = 240 };
enum { dataLength = width*height*sizeof(unsigned short) };
sharedImage(){}
interprocess_mutex mutex;
unsigned short data[dataLength];
};
shared_memory_object shm;
sharedImage * sIm;
mapped_region region;
int setupSharedMemory(){
// Clear the object if it exists
shared_memory_object::remove("ImageMem");
shm = shared_memory_object(create_only /*only create*/,"ImageMem" /*name*/,read_write/*read-write mode*/);
printf("Size:%i\n",sizeof(sharedImage));
//Set size
shm.truncate(sizeof(sharedImage));
//Map the whole shared memory in this process
region = mapped_region(shm, read_write);
//Get the address of the mapped region
void * addr = region.get_address();
//Construct the shared structure in the preallocated memory of shm
sIm = new (addr) sharedImage;
return 0;
}
int shutdownSharedMemory(){
shared_memory_object::remove("ImageMem");
return 0;
}
To start it up call setupSharedMemory() and to shut down call shutdownSharedMemory().
All the values are hard coded in this simple example but its easy to imagine making it more flexible.
Now lets assume that you are using SoftKinetic's DepthSense. So then you could write the following callback for the Depth node.
void onNewDepthSample(DepthNode node, DepthNode::NewSampleReceivedData data) {
//scoped_lock<interprocess_mutex> lock(sIm->mutex);
memcpy(sIm->data, data.depthMap, sIm->dataLength);
}
What this does is simply copies the latest depth map into the shared memory space.
You could also add a timestamp and a lock and anything else you need but this basic code works well enough for me so I will leave it as it is.
Now in some other process you can access the data in a very similar fashion.
The code below is what I use to get the live SoftKinetic DepthSense depth stream into Matlab for real time processing. This method has a huge advantage over trying to write my own mex wrapper specifically for SoftKinetic because I can use the same code for all the other cameras if I write servers for them.
#include <math.h>
#include <windows.h>
#include "mex.h"
#define BOOST_ALL_NO_LIB
#include <boost/interprocess/shared_memory_object.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/interprocess/sync/scoped_lock.hpp>
#include <boost/interprocess/sync/interprocess_mutex.hpp>
#include <iostream>
#include <cstdio>
#include <cstdlib>
using namespace boost::interprocess;
struct sharedImage
{
enum { width = 320 };
enum { height = 240 };
enum { dataLength = width*height*sizeof(short) };
sharedImage(): dirty(true){}
interprocess_mutex mutex;
uint8_t data[dataLength];
bool dirty;
};
void getFrame(unsigned short *D)
{
//Open the shared memory object.
shared_memory_object shm(open_only ,"ImageMem", read_write);
//Map the whole shared memory in this process
mapped_region region(shm ,read_write);
//Get the address of the mapped region
void * addr = region.get_address();
//Construct the shared structure in memory
sharedImage * sIm = static_cast<sharedImage*>(addr);
//scoped_lock<interprocess_mutex> lock(sIm->mutex);
memcpy((char*)D, (char*)sIm->data, sIm->dataLength);
}
void mexFunction(int nlhs, mxArray *plhs[ ], int nrhs, const mxArray *prhs[ ])
{
// Build outputs
mwSize dims[2] = {320, 240};
plhs[0] = mxCreateNumericArray(2, dims, mxUINT16_CLASS, mxREAL);
unsigned short *D = (unsigned short*)mxGetData(plhs[0]);
try
{
getFrame(D);
}
catch (interprocess_exception &ex)
{
mexPrintf("getFrame:%s\n", ex.what());
}
}
which on my computer I compile in Matlab with: mex getSKFrame.cpp -IC:\Development\boost_1_48_0
And then finally to use it in Matlab: D = getSKFrame()'; imagesc(D)

Related

How to optimize SYCL kernel

I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
Thanks in advance for the help!
Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.
Equally, if you had:
constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc
The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.
I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
%5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
%add.i = fadd float %5, 2.000000e+00
store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.
Conclusion: you've already achieved maximum efficiency, and compilers are smart!

Number of mapped views to a shared memory on Windows

Is there a way to check how many views have been mapped on to a memory mapped file on Windows?
Something like the equivalent of shmctl(... ,IPC_STAT,...) on Linux?
I had the same need to access the number of shared views. So I created this question: Accessing the number of shared memory mapped file views (Windows)
You may find a solution that suits your needs there.
As per Scath comment, I am going to add here the proposed solution, although merit should go to eryksun and RbMm. Making use of NtQueryObject call one can access the HandleCount (although it may not be 100% reliable):
#include <stdio.h>
#include <windows.h>
#include <winternl.h>
typedef NTSTATUS (__stdcall *NtQueryObjectFuncPointer) (
HANDLE Handle,
OBJECT_INFORMATION_CLASS ObjectInformationClass,
PVOID ObjectInformation,
ULONG ObjectInformationLength,
PULONG ReturnLength);
int main(void)
{
_PUBLIC_OBJECT_BASIC_INFORMATION pobi;
ULONG rLen;
// Create the memory mapped file (in system pagefile) (better in global namespace
// but needs SeCreateGlobalPrivilege privilege)
HANDLE hMap = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE|SEC_COMMIT,
0, 1, "Local\\UniqueShareName");
// Get the NtQUeryObject function pointer and then the handle basic information
NtQueryObjectFuncPointer _NtQueryObject = (NtQueryObjectFuncPointer)GetProcAddress(
GetModuleHandle("ntdll.dll"), "NtQueryObject");
_NtQueryObject(hMap, ObjectBasicInformation, (PVOID)&pobi, (ULONG)sizeof(pobi), &rLen);
// Check limit
if (pobi.HandleCount > 4) {
printf("Limit exceeded: %ld > 4\n", pobi.HandleCount);
exit(1);
}
//...
Sleep(30000);
}

Accessing SuperBlock object of linux kernel in a system call

I am trying to access super block object which is defined in linux/fs.h.
But how to initialize the object so that we can access it's properties.
I found that alloc_super() is used to initialize super but how is it called?
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <errno.h>
#include <linux/fs.h>
int main(){
printf("hello there");
struct super_block *sb;
return 0;
}
The answer is very much file system dependent, since different file systems will have different super block layouts and infact different arrangements of blocks.
For instance, ext2 file systems superblock is in a known location on disk (byte 1024), and has a known size (sizeof(struct superblock) bytes).
So a typical implementation (This is not a working code but with minor modification can be made to work ) of what you want would be:
struct superblock *read_superblock(int fd) {
struct superblock *sb = malloc(sizeof(struct superblock));
assert(sb != NULL);
lseek(fd, (off_t) 1024, SEEK_SET));
read(fd, (void *) sb, sizeof(struct superblock));
return sb;
}
Now, you can alloc superblock using linux/headers, or write your own struct that exactly matches with the ext2/ext3/etc/etc file systems superblock.
Then you must know where to find the superblock (the lseek() comes here).
Also you need to pass the disk file name file_descriptor to the function.
So do a
int fd = open(argv[1], O_RDONLY);
struct superblock * sb = read_superblock(fd);

how to create multiple containers in boost shared memory?

Constructing multiple objects in shared memory is possible as shown in this example:
#include <boost/interprocess/managed_shared_memory.hpp>
#include <functional>
#include <iostream>
using namespace boost::interprocess;
void construct_objects(managed_shared_memory &managed_shm)
{
managed_shm.construct<int>("Integer")(99);
managed_shm.construct<float>("Float")(3.14);
}
int main()
{
shared_memory_object::remove("Boost");
managed_shared_memory managed_shm{open_or_create, "Boost", 1024};
auto atomic_construct = std::bind(construct_objects,
std::ref(managed_shm));
managed_shm.atomic_func(atomic_construct);
std::cout << *managed_shm.find<int>("Integer").first << '\n';
std::cout << *managed_shm.find<float>("Float").first << '\n';
}
But when I try to create two vectors or a vector and a list, I run into problems with the memory allocation. Is there a way to create multiple containers in a single shared memory in Boost?
I had a look at managed_memory_impl.hpp, but it wasn't of much help either.
This is my code (you have to link it with lib pthread and librt):
#include <boost/interprocess/mapped_region.hpp>
#include <boost/interprocess/sync/interprocess_semaphore.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <boost/interprocess/containers/list.hpp>
#include <boost/interprocess/managed_shared_memory.hpp>
#include <cstdlib> //std::system
#include <cstddef>
#include <cassert>
#include <utility>
#include <iostream>
typedef boost::interprocess::allocator<int, boost::interprocess::managed_shared_memory::segment_manager> ShmemAllocator; //Define an STL compatible allocator of ints that allocates from the managed_shared_memory. This allocator will allow placing containers in the segment
typedef boost::interprocess::vector<int, ShmemAllocator> MyVector; //Alias a vector that uses the previous STL-like allocator so that allocates its values from the segment
typedef boost::interprocess::allocator<int, boost::interprocess::managed_shared_memory::segment_manager> ShmemListAllocator;
typedef boost::interprocess::list<int, ShmemListAllocator> MyList;
int main(int argc, char *argv[])
{
//Construct managed shared memory
boost::interprocess::managed_shared_memory segment(boost::interprocess::create_only, "MySharedMemory", 65536);
//const ShmemAllocator alloc_inst(segment.get_segment_manager());
MyVector *instance = segment.construct<MyVector>("MyType instance")(segment.get_segment_manager());
MyVector *instance2 = segment.construct<MyVector>("MyType instance")(segment.get_segment_manager());
MyList *instance3 = segment.construct<MyList>("MyList instance")(segment.get_segment_manager());
return 0;
}//main
You should either use unique names, or you can use the indexed ("array") style of construction.
See the documentation for the Object construction function family:
//!Allocates and constructs an array of objects of type MyType (throwing version)
//!Each object receives the same parameters (par1, par2, ...)
MyType *ptr = managed_memory_segment.construct<MyType>("Name")[count](par1, par2...);
and
//!Tries to find a previously created object. If not present, allocates and
//!constructs an array of objects of type MyType (throwing version). Each object
//!receives the same parameters (par1, par2, ...)
MyType *ptr = managed_memory_segment.find_or_construct<MyType>("Name")[count](par1, par2...);
and
//!Allocates and constructs an array of objects of type MyType (throwing version)
//!Each object receives parameters returned with the expression (*it1++, *it2++,... )
MyType *ptr = managed_memory_segment.construct_it<MyType>("Name")[count](it1, it2...);
and possibly some more. Look for [count].
(I recommend using unique names for simplicity)
Update
To the comments, here's what I meant with "unique name". I've tested it, and itworks fine:
Live1 On Coliru
#include <boost/interprocess/containers/vector.hpp>
#include <boost/interprocess/containers/list.hpp>
#include <boost/interprocess/managed_shared_memory.hpp>
#include <cassert>
typedef boost::interprocess::allocator<int, boost::interprocess::managed_shared_memory::segment_manager>
ShmemAllocator; // Define an STL compatible allocator of ints that allocates from the managed_shared_memory. This allocator
// will allow placing containers in the segment
typedef boost::interprocess::vector<int, ShmemAllocator> MyVector; // Alias a vector that uses the previous STL-like allocator so
// that allocates its values from the segment
typedef boost::interprocess::allocator<int, boost::interprocess::managed_shared_memory::segment_manager> ShmemListAllocator;
typedef boost::interprocess::list<int, ShmemListAllocator> MyList;
int main()
{
// Construct managed shared memory
std::remove("/dev/shm/MySharedMemory");
boost::interprocess::managed_shared_memory segment(boost::interprocess::create_only, "MySharedMemory", 65536);
// const ShmemAllocator alloc_inst(segment.get_segment_manager());
MyVector *instance = segment.construct<MyVector>("MyType instance 1")(segment.get_segment_manager());
MyVector *instance2 = segment.construct<MyVector>("MyType instance 2")(segment.get_segment_manager());
MyList *instance3 = segment.construct<MyList> ("MyList instance")(segment.get_segment_manager());
assert(instance);
assert(instance2);
assert(instance3);
assert(!std::equal_to<void*>()(instance, instance2));
assert(!std::equal_to<void*>()(instance, instance3));
assert(!std::equal_to<void*>()(instance2, instance3));
}
1 Of course, SHM is not supported on Coliru. However, identical sample using mapped file: Live On Coliru

Can someone help me replace "lock_kernel" on a block device driver?

Thank you for looking at this post. I am trying to patch up a network block device driver. If you need to see the sources they are at http : / / code.ximeta.com.
I noticed that lock_kernel() seems deprecated as of linux 2.6.37. I read "The new way of ioctl()" and found that device drivers now should perform a specific lock before operating.
So I would like some advice replacing this if possible.
I have found two sections in the current code that I think are related, in the block folder section.
Source
block->io.c
->ctrldev.c
I put snippets from each for your consideration.
io.c contains one call to lock_kernel:
NDAS_SAL_API xbool sal_file_get_size(sal_file file, xuint64* size)
{
definitions and declarations etc..
lock_kernel();
#ifdef HAVE_UNLOCKED_IOCTL
if (filp->f_op->unlocked_ioctl) {
some small statements
error = filp->f_op->unlocked_ioctl(filp, BLKGETSIZE64, (unsigned long)size);
actions if error or not etc.
}
#endif
unlock_kernel();
return ret;
}
And ctrldev.c contains the main io function:
#include <linux/spinlock.h> // spinklock_t
#include <linux/semaphore.h> // struct semaphore
#include <asm/atomic.h> // atomic
#include <linux/interrupt.h>
#include <linux/fs.h>
#include <asm/uaccess.h>
#include <linux/ide.h>
#include <linux/smp_lock.h>
#include <linux/time.h>
......
int ndas_ctrldev_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg)
{
lots of operations and functions.
return result;
}
Later ndas_ctrldev_ioctl function is set as the former .ioctl.
static struct file_operations ndasctrl_fops = {
.write = ndas_ctrldev_write,
.read = ndas_ctrldev_read,
.open = ndas_ctrldev_open,
.release = ndas_ctrldev_release,
.ioctl = ndas_ctrldev_ioctl,
};
Now I want to convert this to avoid using lock_kernel();
According to my understanding I will modified the former sections as below:
NDAS_SAL_API xbool sal_file_get_size(sal_file file, xuint64* size)
{
definitions and declarations etc..
#ifndef HAVE_UNLOCKED_IOCTL
lock_kernel();
#endif
#ifdef HAVE_UNLOCKED_IOCTL
if (filp->f_op->unlocked_ioctl) {
some small statements
error = filp->f_op->unlocked_ioctl(filp, BLKGETSIZE64, (unsigned long)size);
actions if error or not etc.
}
#endif
#ifndef HAVE_UNLOCKED_IOCTL
unlock_kernel();
#endif
return ret;
}
#ifdef HAVE_UNLOCKED_IOCTL
long ndas_ctrldev_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
#else
int ndas_ctrldev_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg)
#endif
{
#ifdef HAVE_UNLOCKED_IOCTL
! add some sort of lock here !
#endif
lots of operations and functions.
#ifdef HAVE_UNLOCKED_IOCTL
! add unlock statement here !
#endif
return result;
}
static struct file_operations ndasctrl_fops = {
.write = ndas_ctrldev_write,
.read = ndas_ctrldev_read,
.open = ndas_ctrldev_open,
.release = ndas_ctrldev_release,
#ifdef HAVE_UNLOCKED_IOCTL
.unlocked_ioctl = ndas_ctrldev_ioctl,
#else
.ioctl = ndas_ctrldev_ioctl,
#endif
};
So, I would ask the following advice.
Does this look like the right
proceedure?
Do I understand correct to move the
lock into the io function?
Based on the includes in crtrldev.c, can you
recommend any lock off the top of
your head? (I tried to research some
other drivers dealing with filp and
lock_kernel, but I am too much a
noob to find the answer right away.)
The Big Kernel Lock (BKL) is more than deprecated - as of 2.6.39, it does not exist anymore.
The way the lock_kernel() conversion was done was to replace it by per-driver mutexes. If the driver is simple enough, you can simply create a mutex for the driver, and replace all uses of lock_kernel() and unlock_kernel() by the mutex lock/unlock calls. Note, however, that some functions used to be called with the BKL (the lock lock_kernel() used to lock) held; you will have to add lock/unlock calls to these functions too.
This will not work if the driver could acquire the BKL recursively; if that is the case, you would have to track it yourself to avoid deadlocks (this was done in the conversion of reiserfs, which depended somewhat heavily both in the recursive BKL behavior and in the fact that it was dropped when sleeping).
The next step after the conversion to a per-driver mutex would be to change it to use a per-device mutex instead of a per-driver mutex.
Here is the solution.
#if HAVE_UNLOCKED_IOCTL
#include <linux/mutex.h>
#else
#include <linux/smp_lock.h>
#endif
.
.
.
#if HAVE_UNLOCKED_IOCTL
mutex_lock(&fs_mutex);
#else
lock_kernel();
#endif
This only shows replacing the lock call. The other parts worked out as I guessed in the question part above concerning unlocked_ioctl. Thanks for checking and for helping.

Resources