MacOS shm - Unable to get true data size in shm - macos

When performing shm-related development on MacOS, the searched processes are shown in the following code (verification is indeed correct).
However, there is a new problem that cannot be solved. It is found that when ftruncat adjusts the memory size for shm_fd, it is allocated according to the multiple of the page size.
But in this case, when the shared memory file is opened by other processes, the actual data size cannot be obtained correctly. The obtained file size is an integer multiple of the page, which will cause an error when appending data.
// write data_size = 12
char *data = "....";
long data_size = 12;
shmFD = shm_open(...);
ftruncate(shmFD, data_size); // Actually the size actually allocated is not 12, but 4096
shmAddr = (char *)mmap(NULL, data_size, ... , shmFD, 0);
memcpy(shmAddr, data, data_size);
// read
...
fstat(shmFD, &sb)
long context_len_in_shm = sb.st_size;
// get wrong shm size -> context_len_in_shm = 4096

Temporarily use the following structure to record data into shm. The first operation before writing or reading is to get the value of the data_len field, and then determine the length of the data to be read and written from the back. Hope for a more concise way, just like the use of lseek() under Linux.
shm mem map :
----shm mem----
struct {
long data_len;
data[1];
data[2];
...
data[data_len];
}
---------------
long *shm_mem = (long *)shmAddr;
long data_size = shm_mem[0]; // Before reading, you need to determine whether the shm file is empty and whether the pointer is valid. It is omitted here.
char *shm_data = (char *)&(shm_mem[1]);
char *buffer = (char *)malloc(data_size);
memcpy(buffer, shm_data, data_size);

Related

How to use GDCM to write voxel data, slice by slice?

In all the examples I've seen for GDCM on how to write image data, they always consider the image volume as a single whole, cohesive buffer. The basic structure is along the lines
#include "gdcmImage.h"
#include "gdcmImageWriter.h"
#include "gdcmFileDerivation.h"
#include "gdcmUIDGenerator.h"
int write_image(...)
{
size_t width = ..., height = ..., depth = ...;
auto im = new gdcm::Image;
std::vector<...> buffer;
auto p = buffer.data();
im->SetNumberOfDimensions(3);
im->SetDimension(0, width);
im->SetDimension(1, height);
im->SetDimension(1, depth);
im->GetPixelFormat().SetSamplesPerPixel(...);
im->SetPhotometricInterpretation( gdcm::PhotometricInterpretation::... );
unsigned long l = im->GetBufferLength();
if( l != width * height * depth * sizeof(...) ){ return SOME_ERROR; }
gdcm::DataElement pixeldata( gdcm::Tag(0x7fe0,0x0010) );
pixeldata.SetByteValue( buffer.data(), buffer.size()*sizeof(*buffer.data()) );
im->SetDataElement( pixeldata );
gdcm::UIDGenerator uid;
auto file = new gdcm::File;
gdcm::FileDerivation fd;
const char UID[] = ...;
fd.AddReference( ReferencedSOPClassUID, uid.Generate() );
fd.SetFile( *file );
// If all Code Value are ok the filter will execute properly
if( !fd.Derive() ){ return SOME_ERROR; }
gdcm::ImageWriter w;
w.SetImage( *im );
w.SetFile( fd.GetFile() );
// Set the filename:
w.SetFileName( "some_image.dcm" );
if( !w.Write() ){ return SOME_ERROR; }
return 0;
}
The problem I'm facing with this approach is, that the amount of image data I need to store easily exceeds the available system memory, if an additional copy is being made; specifically these are volumes of 4096×4096×2048 voxels of 12 bits each, so about 48GiB of data in memory.
However the approach of using gdcm::DataElement and gdcm::Image::SetDataElement will obviously create a full copy of the data in buffer, which is troublesome. For one, the data as produced by my imaging system does not reside in memory as a cohesive, singular block of values; it is split into slices. And the total amount of data fits into the memory of the systems being used only once.
It is trivial for me, to read in the data slice by slice, which would cut down the memory requirements significantly. However I'm at a loss, how that'd be done with GDCM.
Did you check gdcm::FileStreamer:
http://gdcm.sourceforge.net/3.0/html/classgdcm_1_1FileStreamer.xhtml
See typical setup at:
https://github.com/malaterre/GDCM/blob/master/Examples/Csharp/FileStreaming.cs
The example show how to create an out of memory private element, but you can do the same with public DataElement.
A more complex example to read where Pixel Data is written in chunks is at:
https://github.com/malaterre/GDCM/blob/master/Examples/Csharp/FileChangeTS.cs#L126-L154

How to replace read function for procfs entry that returned EOF and byte count read both?

I am working on updating our kernel drivers to work with linux kernel 4.4.0 on Ubuntu 16.0.4. The drivers last worked with linux kernel 3.9.2.
In one of the modules, we have a procfs entries created to read/write the on-board fan monitoring values. Fan monitoring is used to read/write the CPU or GPU temperature/modulation,etc. values.
The module is using the following api to create procfs entries:
struct proc_dir_entry *create_proc_entry(const char *name, umode_t
mode,struct proc_dir_entry *parent);
Something like:
struct proc_dir_entry * proc_entry =
create_proc_entry("fmon_gpu_temp",0644,proc_dir);
proc_entry->read_proc = read_proc;
proc_entry->write_proc = write_proc;
Now, the read_proc is implemented something in this way:
static int read_value(char *buf, char **start, off_t offset, int count, int *eof, void *data) {
int len = 0;
int idx = (int)data;
if(idx == TEMP_FANCTL)
len = sprintf (buf, "%d.%02d\n", fmon_readings[idx] / TEMP_SAMPLES,
fmon_readings[idx] % TEMP_SAMPLES * 100 / TEMP_SAMPLES);
else if(idx == TEMP_CPU) {
int i;
len = sprintf (buf, "%d", fmon_readings[idx]);
for( i=0; i < FCTL_MAX_CPUS && fmon_cpu_temps[i]; i++ ) {
len += sprintf (buf+len, " CPU%d=%d",i,fmon_cpu_temps[i]);
}
len += sprintf (buf+len, "\n");
}
else if(idx >= 0 && idx < READINGS_MAX)
len = sprintf (buf, "%d\n", fmon_readings[idx]);
*eof = 1;
return len;
}
This read function definitely assumes that the user has provided enough buffer space to store the temperature value. This is correctly handled in userspace program. Also, for every call to this function the read value is in totality and therefore there is no support/need for subsequent reads for same temperature value.
Plus, if I use "cat" program on this procfs entry from shell, the 'cat' program correctly displays the value. This is supported, I think, by the setting of EOF to true and returning read bytes count.
New linux kernels do not support this API anymore.
My question is:
How can I change this API to new procfs API structure keeping the functionality same as: every read should return the value, program 'cat' should also work fine and not go into infinite loop ?
The primary user interface for read files on Linux is read(2). Its pair in kernel space is .read function in struct file_operations.
Every other mechanism for read file in kernel space (read_proc, seq_file, etc.) is actually an (parametrized) implementation of .read function.
The only way for kernel to return EOF indicator to user space is returning 0 as number of bytes read.
Even read_proc implementation you have for 3.9 kernel actually implements eof flag as returning 0 on next invocation. And cat actually perfoms the second invocation of read for find that file is end.
(Moreover, cat performs more than 2 invocations of read: first with 1 as count, second with count equal to page size minus 1, and the last with remaining count.)
The simplest way for "one-shot" read implementation is using seq_file in single_open() mode.

cudaMemcpy() gives segfault when using Type**

I want to copy a double pointer object to the host and compute over it on the GPU Device. When doing cudaMemcpy of the object to device it throws SEGFAULT.
BMP Input;
Input.ReadFromFile( fileName );
WIDTH = Input.TellWidth();
HEIGHT = Input.TellHeight();
RGBApixel** imageData = new RGBApixel* [HEIGHT];
for (int i = 0; i < HEIGHT; i++)
imageData[i] = new RGBApixel [WIDTH];
for(int j=0;j<Input.TellHeight();j++){
for(int i=0;i<Input.TellWidth();i++){
imageData[j][i] = Input.GetPixel(i,j);
}
}
long long imageSize = WIDTH*HEIGHT*sizeof(RGBApixel *);
RGBApixel** dev_imgdata,dev_imgdata_out;
//Allocating cudaMemory
cudaMalloc( (void **) &dev_imgdata, imageSize );
cudaMalloc( (void **) &dev_imgdata_out, imageSize );
Now the below line throws segfault
cudaMemcpy(dev_imgdata,imageData,imageSize,cudaMemcpyHostToDevice);
When declaring RGBApixel** imageData = new RGBApixel* [HEIGHT]; you have absolutely no guarantee that imageData will occupy a contiguous block of memory.
cudaMemcpy copies contiguous blocks of memory into the device RAM. Your statement tries to copy the start addresses of each matrix row but not the actual data. Also when using cudaMalloc, you need to properly allocate for each line, exactly as you did for the host buffer.
What you need to do is to declare imageData as just a RGMAPixel* - basically put the matrix in a single vector and use proper indexing and it will work.
You can also copy each line at a time but that's not a very good practice since every memory access will require an extra indirection and you will mess the caching efficiency.
Also, make sure that when you compile your program, you use -arch sm_20 to enable extra options for your graphic card ( if it has Capability 2.0). Without it I believe you can't use double and the result is unpredictable (or the double is diminished to float)

I/O to device from kernel module fails with EFAULT

I have created block device in kernel module. When some I/O happens I read/write all data from/to another existing device (let's say /dev/sdb).
It opens OK, but read/write operations return 14 error(EFAULT,Bad Address). After some research I found that I need map address to user space(probably buffer or filp variables), but copy_to_user function does not help. Also I looked to mmap() and remap_pfn_range() functions, but I can not get how to use them in my code, especially where to get correct vm_area_struct structure. All examples that I found, used char devices and file_operations structure, not block device.
Any hints? Thanks for help.
Here is my code for reading:
mm_segment_t old_fs;
old_fs = get_fs();
set_fs(KERNEL_DS);
filp = filp_open("/dev/sdb", O_RDONLY | O_DIRECT | O_SYNC, 00644);
if(IS_ERR(filp))
{
set_fs(old_fs);
int err = PTR_ERR(filp);
printk(KERN_ALERT"Can not open file - %d", err);
return;
}
else
{
bytesRead = vfs_read(filp, buffer, nbytes, &offset); //It gives 14 error
filp_close(filp, NULL);
}
set_fs(old_fs);
I found a better way for I/O to block device from kernel module. I have used bio structure for that. Hope this information save somebody from headache.
1) So, if you want to redirect I/O from your block device to existing block device, you have to use own make_request function. For that you should use blk_alloc_queue function to create queue for your block device like this:
device->queue = blk_alloc_queue(GFP_KERNEL);
blk_queue_make_request(device->queue, own_make_request);
Than into own_make_request function change bi_bdev member into bio structure to device in which you redirecting I/O and call generic_make_request function:
bio->bi_bdev = device_in_which_redirect;
generic_make_request(bio);
More information here at 16 chapter. If link is broken by some cause, here is name of the book - "Linux Device Drivers, Third Edition"
2) If you want read or write your own data to existing block device from kernel module you should use submit_bio function.
Code for writing into specific sector(you need to implement writeComplete function also):
void writePage(struct block_device *device,
sector_t sector, int size, struct page *page)
{
struct bio *bio = bio_alloc(GFP_NOIO, 1);
bio->bi_bdev = vnode->blkDevice;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
bio->bi_end_io = writeComplete;
submit_bio(WRITE_FLUSH_FUA, bio);
}
Code for reading from specific sector(you need to implement readComplete function also):
int readPage(struct block_device *device, sector_t sector, int size,
struct page *page)
{
int ret;
struct completion event;
struct bio *bio = bio_alloc(GFP_NOIO, 1);
bio->bi_bdev = device;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
init_completion(&event);
bio->bi_private = &event;
bio->bi_end_io = readComplete;
submit_bio(READ | REQ_SYNC, bio);
wait_for_completion(&event);
ret = test_bit(BIO_UPTODATE, &bio->bi_flags);
bio_put(bio);
return ret;
}
page can be allocated with alloc_page(GFP_KERNEL). Also for changing data in page use page_address(page). It returns void* so you can interpret that pointer as whatever you want.

optimization of sequential i/o operations on large file sizes

Compiler: Microsoft C++ 2005Hardware: AMD 64-bit (16 GB)
Sequential, read-only access from an 18GB file is committed with the following timing, file access, and file structure characteristics:18,184,359,164 (file length)11,240,476,672 (ntfs compressed file length)
Time File Method Disk
14:33? compressed fstream fixed disk
14:06 normal fstream fixed disk
12:22 normal winapi fixed disk
11:47 compressed winapi fixed disk
11:29 compressed fstream ram disk
10:37 compressed winapi ram disk
7:18 compressed 7z stored decompression to ntfs 12gb ram disk
6:37 normal copy to same volume fixed disk
The fstream constructor and access:
define BUFFERSIZE 524288
unsigned int mbytes = BUFFERSIZE;
char * databuffer0; databuffer0 = (char*) malloc (mbytes);
datafile.open("drv:/file.ext", ios::in | ios::binary );
datafile.read (databuffer0, mbytes);
The winapi constructor and access:
define BUFFERSIZE 524288
unsigned int mbytes = BUFFERSIZE;
const TCHAR* const filex = _T("drv:/file.ext");
char ReadBuffer[BUFFERSIZE] = {0};
hFile = CreateFile(filex, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if( FALSE == ReadFile(hFile, ReadBuffer, BUFFERSIZE-1, &dwBytesRead, NULL))
{ ...
For the fstream method, -> 16MB buffer sizes do not decrease processing time. All buffer sizes beyond .5MB fail for the winapi method. What methods would optimize this implementation versus processing time?
Did you try memory-mapping the file? In my test this was always the fastest way to read large files.
Update: Here's an old, but still accurate description of memory mapped files:
http://msdn.microsoft.com/en-us/library/ms810613.aspx
Try this.
hf = CreateFile(..... FILE_FLAG_NO_BUFFERING | FILE_FLAG_OVERLAPPED ...)
Then the reading loop. Minor details omitted as typing on iPad...
int bufsize =4*1024*1024;
CEvent e1;
CEvent e2;
CEvent e3;
CEvent e4;
unsigned char* pbuffer1 = malloc(bufsize);
unsigned char* pbuffer2 = malloc(bufsize);
unsigned char* pbuffer3 = malloc(bufsize);
unsigned char* pbuffer4 = malloc(bufsize);
int CurOffset = 0;
do {
OVERLAPPED r1;
memset(&r1, 0, sizeof(OVERLAPPED));
r1.Offset = CurOffset;
CurOffset += bufsize;
r1.hEvent = e1;
if (! ReadFile(hf, pbuffer1, bufsize, bufsize, &r1)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r2;
memset(&r2, 0, sizeof(OVERLAPPED));
r2.Offset = CurOffset;
CurOffset += bufsize;
r2.hEvent = e2;
if (! ReadFile(hf, pbuffer2, bufsize, bufsize, &r2)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r3;
memset(&r3, 0, sizeof(OVERLAPPED));
r3.Offset = CurOffset;
CurOffset += bufsize;
r3.hEvent = e3;
if (! ReadFile(hf, pbuffer3, bufsize, bufsize, &r3)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r4;
memset(&r4, 0, sizeof(OVERLAPPED));
r4.Offset = CurOffset;
CurOffset += bufsize;
r4.hEvent = e4;
if (! ReadFile(hf, pbuffer1, bufsize, bufsize, &r4)) {
// check for error AND error_handle_eof (important)
}
// wait for events to indicate data present
// send data to consuming threads
// allocate new buffer
} while ( not eof, etc )
The above is the bones of what you need. We use this and achieve high I/O throughput rates, but you will need to perhaps improve it slightly to achieve ultimate performance. We found 4 outstanding I/O was best for our use, but this will vary by platform. Reading less than 1Mb per IO was performance negative. Once you have the buffer read, don't ty and consume it in the reading loop, post to another thread, and allocate another buffer (but get them from a reuse queue, dont keep using malloc). The overall intent of the above is to try and keep 4 outstanding IO open to the disk, as soon as you don't have this, overall performance will drop.
Also, this works best on a disk that is only Reading your file. If you start reading/writing different files on the same disk at same time, performance drops quickly, unless you have SSD disks!
Not sure why your readfile is failing for 0.5Mb buffers, just double checked and our live prod code is using 4Mb buffers

Resources