optimization of sequential i/o operations on large file sizes - winapi

Compiler: Microsoft C++ 2005Hardware: AMD 64-bit (16 GB)
Sequential, read-only access from an 18GB file is committed with the following timing, file access, and file structure characteristics:18,184,359,164 (file length)11,240,476,672 (ntfs compressed file length)
Time File Method Disk
14:33? compressed fstream fixed disk
14:06 normal fstream fixed disk
12:22 normal winapi fixed disk
11:47 compressed winapi fixed disk
11:29 compressed fstream ram disk
10:37 compressed winapi ram disk
7:18 compressed 7z stored decompression to ntfs 12gb ram disk
6:37 normal copy to same volume fixed disk
The fstream constructor and access:
define BUFFERSIZE 524288
unsigned int mbytes = BUFFERSIZE;
char * databuffer0; databuffer0 = (char*) malloc (mbytes);
datafile.open("drv:/file.ext", ios::in | ios::binary );
datafile.read (databuffer0, mbytes);
The winapi constructor and access:
define BUFFERSIZE 524288
unsigned int mbytes = BUFFERSIZE;
const TCHAR* const filex = _T("drv:/file.ext");
char ReadBuffer[BUFFERSIZE] = {0};
hFile = CreateFile(filex, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if( FALSE == ReadFile(hFile, ReadBuffer, BUFFERSIZE-1, &dwBytesRead, NULL))
{ ...
For the fstream method, -> 16MB buffer sizes do not decrease processing time. All buffer sizes beyond .5MB fail for the winapi method. What methods would optimize this implementation versus processing time?

Did you try memory-mapping the file? In my test this was always the fastest way to read large files.
Update: Here's an old, but still accurate description of memory mapped files:
http://msdn.microsoft.com/en-us/library/ms810613.aspx

Try this.
hf = CreateFile(..... FILE_FLAG_NO_BUFFERING | FILE_FLAG_OVERLAPPED ...)
Then the reading loop. Minor details omitted as typing on iPad...
int bufsize =4*1024*1024;
CEvent e1;
CEvent e2;
CEvent e3;
CEvent e4;
unsigned char* pbuffer1 = malloc(bufsize);
unsigned char* pbuffer2 = malloc(bufsize);
unsigned char* pbuffer3 = malloc(bufsize);
unsigned char* pbuffer4 = malloc(bufsize);
int CurOffset = 0;
do {
OVERLAPPED r1;
memset(&r1, 0, sizeof(OVERLAPPED));
r1.Offset = CurOffset;
CurOffset += bufsize;
r1.hEvent = e1;
if (! ReadFile(hf, pbuffer1, bufsize, bufsize, &r1)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r2;
memset(&r2, 0, sizeof(OVERLAPPED));
r2.Offset = CurOffset;
CurOffset += bufsize;
r2.hEvent = e2;
if (! ReadFile(hf, pbuffer2, bufsize, bufsize, &r2)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r3;
memset(&r3, 0, sizeof(OVERLAPPED));
r3.Offset = CurOffset;
CurOffset += bufsize;
r3.hEvent = e3;
if (! ReadFile(hf, pbuffer3, bufsize, bufsize, &r3)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r4;
memset(&r4, 0, sizeof(OVERLAPPED));
r4.Offset = CurOffset;
CurOffset += bufsize;
r4.hEvent = e4;
if (! ReadFile(hf, pbuffer1, bufsize, bufsize, &r4)) {
// check for error AND error_handle_eof (important)
}
// wait for events to indicate data present
// send data to consuming threads
// allocate new buffer
} while ( not eof, etc )
The above is the bones of what you need. We use this and achieve high I/O throughput rates, but you will need to perhaps improve it slightly to achieve ultimate performance. We found 4 outstanding I/O was best for our use, but this will vary by platform. Reading less than 1Mb per IO was performance negative. Once you have the buffer read, don't ty and consume it in the reading loop, post to another thread, and allocate another buffer (but get them from a reuse queue, dont keep using malloc). The overall intent of the above is to try and keep 4 outstanding IO open to the disk, as soon as you don't have this, overall performance will drop.
Also, this works best on a disk that is only Reading your file. If you start reading/writing different files on the same disk at same time, performance drops quickly, unless you have SSD disks!
Not sure why your readfile is failing for 0.5Mb buffers, just double checked and our live prod code is using 4Mb buffers

Related

MMAP buffer kernel writes are not seen by user space

i have a kernel driver which shares a buffer with the user space layer.
Everything seemed to work fine in my VM prototype (Ubuntu, Kernel 5.4) but when i moved my code to the target (same kernel but this is an embedded distro) I can clearly see that Kernel writes to the buffer (using memcpy, or memset) are not reflected in the User space side of the buffer.
Note that, i use direct buffer accesses on both sides. There is no concurrency issue, as the Kernel writes to, then the user space reads from.
I ended up believing this is a cache issue ... as the same code works perfectly in my VM.
The buffer size is 4 * PAGE_SIZE.
It is allocated as follows:
int _size = (SFP_BUFFER_SIZE + (PAGE_SIZE-1)) & ~(PAGE_SIZE-1);
input_buffer = (char*) kzalloc (_size, GFP_KERNEL); // aligned on page boundary
if (!input_buffer) {
dev_dbg(&dev, "open/ENOMEM (input_buffer)\n");
status = -ENOMEM;
goto err_all
When mmap'ing, i used the following code pattern:
vma->vm_ops = &fpgadrv_vm_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
pfn = virt_to_phys((void*)(input_buffer)) >> PAGE_SHIFT;
if (remap_pfn_range (vma, vma->vm_start, pfn, size, vma->vm_page_prot))
{
printk(KERN_DEBUG "remap page range failed\n");
return -EAGAIN;
}
User space code, and kernel code user memcpy to update the buffer. Note also that I cannot use write/read entry points, as they are already used for very specific operations.
The user code is calling mmap as follows:
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, device_fd, 0);
if (buf == MAP_FAILED)
{
perror("USERDRV:cannot mmap");
return -1; // for testing, ignore the return code and continue
}
and upon IOCTL call, the kernel would fill up the mmap buffer as follows:
case IOCTL_RESET:
printk(KERN_DEBUG "FPGADRV: IOCTL RESET");
// reset the buffer (zero + put back the signature)
memset(input_buffer, 0xA5, SFP_BUFFER_SIZE);
memcpy((void*)(input_buffer), (void*)signature, 10);
break;
Is there something more i should do to make sure the pages are not cached (assuming this is the cause of my pb) ?
Thanks,
Jacques

MacOS shm - Unable to get true data size in shm

When performing shm-related development on MacOS, the searched processes are shown in the following code (verification is indeed correct).
However, there is a new problem that cannot be solved. It is found that when ftruncat adjusts the memory size for shm_fd, it is allocated according to the multiple of the page size.
But in this case, when the shared memory file is opened by other processes, the actual data size cannot be obtained correctly. The obtained file size is an integer multiple of the page, which will cause an error when appending data.
// write data_size = 12
char *data = "....";
long data_size = 12;
shmFD = shm_open(...);
ftruncate(shmFD, data_size); // Actually the size actually allocated is not 12, but 4096
shmAddr = (char *)mmap(NULL, data_size, ... , shmFD, 0);
memcpy(shmAddr, data, data_size);
// read
...
fstat(shmFD, &sb)
long context_len_in_shm = sb.st_size;
// get wrong shm size -> context_len_in_shm = 4096
Temporarily use the following structure to record data into shm. The first operation before writing or reading is to get the value of the data_len field, and then determine the length of the data to be read and written from the back. Hope for a more concise way, just like the use of lseek() under Linux.
shm mem map :
----shm mem----
struct {
long data_len;
data[1];
data[2];
...
data[data_len];
}
---------------
long *shm_mem = (long *)shmAddr;
long data_size = shm_mem[0]; // Before reading, you need to determine whether the shm file is empty and whether the pointer is valid. It is omitted here.
char *shm_data = (char *)&(shm_mem[1]);
char *buffer = (char *)malloc(data_size);
memcpy(buffer, shm_data, data_size);

Will Buffering Data Improve the Write Performance in IStream?

I am trying to create an IStream from a disk file, then write many blocks of data into it. Each block is of 4096 byte size. Since there are many blocks to be written, I just thinking of buffering several blocks into one large buffer, and when the large buffer is full, then I perform the actual write to IStream. Will that improve the write performance?
I try to make a small app to test that:
CFile SrcFile;
LPSTREAM lpStreamFile = NULL;
BYTE lpBuf[4096];
BYTE lpLBuf[65536];
BYTE* lpCurBuf;
UINT uRead;
DWORD uStart, uStop;
uStart = ::GetTickCount();
if (SrcFile.Open(_T("D:\\1GB.dat"), CFile::modeRead | CFile::shareExclusive | CFile::typeBinary))
{
SrcFile.SeekToBegin();
::DeleteFile(_T("F:\\1GB.dat"));
if (SUCCEEDED(SHCreateStreamOnFileEx(_T("F:\\1GB.dat"), STGM_READWRITE | STGM_SHARE_EXCLUSIVE | STGM_CREATE | STGM_DIRECT,
FILE_ATTRIBUTE_NORMAL, TRUE, NULL, &lpStreamFile)) && (lpStreamFile != NULL))
{
lpCurBuf = lpLBuf;
while (TRUE)
{
uRead = SrcFile.Read(lpBuf, 4096);
if ((lpCurBuf + uRead) > (lpLBuf + 65536))
{
lpStreamFile->Write(lpLBuf, lpCurBuf - lpLBuf, NULL);
lpCurBuf = lpLBuf;
}
::memcpy(lpCurBuf, lpBuf, uRead);
lpCurBuf += uRead;
// lpStreamFile->Write(lpBuf, 4096, NULL);
if (uRead < 4096)
break;
}
if (lpCurBuf > lpLBuf)
{
lpStreamFile->Write(lpLBuf, lpCurBuf - lpLBuf, NULL);
}
lpStreamFile->Commit(STGC_DEFAULT);
if (lpStreamFile != NULL)
{
// Release the stream
lpStreamFile->Release();
lpStreamFile = NULL;
}
}
SrcFile.Close();
}
uStop = GetTickCount();
CString strMsg;
strMsg.Format(_T("Total tick count = %u."), uStop - uStart);
AfxMessageBox(strMsg);
The large buffer size 65536, 16 times of 4096. I use a 1GB data file to do the test.
To my surprise, when large buffer is not used, then the total time is always about 10sec.
However, when large buffer is used, the first run will take 30sec, then the remaining run will take 10sec.
So it seems that IStream already has an internal written buffer, so that my large buffer does not do any benefit to it. However, I cannot find any document saying that. Neither can I find any document telling me how to control, for example, set the size of the buffer.

Upper limit to UDP performance on windows server 2008

It looks like from my testing I am hitting a performance wall on my 10gb network. I seem to be unable to read more than 180-200k packets per second. Looking at perfmon, or task manager I can receive up to a million packets / second if not more. Testing 1 socket or 10 or 100, doesn't seem to change this limit of 200-300k packets a second. I've fiddled with RSS and the like without success. Unicast vs multicast doesn't seem to matter, overlapped i/o vs synchronous doesn't make a difference either. Size of packet doesn't matter either. There just seems to be a hard limit to the number of packets windows can copy from the nic to the buffer. This is a dell r410. Any ideas?
#include "stdafx.h"
#include <WinSock2.h>
#include <ws2ipdef.h>
static inline void fillAddr(const char* const address, unsigned short port, sockaddr_in &addr)
{
memset( &addr, 0, sizeof( addr ) );
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = inet_addr( address );
addr.sin_port = htons(port);
}
int _tmain(int argc, _TCHAR* argv[])
{
#ifdef _WIN32
WORD wVersionRequested;
WSADATA wsaData;
int err;
wVersionRequested = MAKEWORD( 1, 1 );
err = WSAStartup( wVersionRequested, &wsaData );
#endif
int error = 0;
const char* sInterfaceIP = "10.20.16.90";
int nInterfacePort = 0;
//Create socket
SOCKET m_socketID = socket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );
//Re use address
struct sockaddr_in addr;
fillAddr( "10.20.16.90", 12400, addr ); //"233.43.202.1"
char one = 1;
//error = setsockopt(m_socketID, SOL_SOCKET, SO_REUSEADDR , &one, sizeof(one));
if( error != 0 )
{
fprintf( stderr, "%s: ERROR setsockopt returned %d.\n", __FUNCTION__, WSAGetLastError() );
}
//Bind
error = bind( m_socketID, reinterpret_cast<SOCKADDR*>( &addr ), sizeof( addr ) );
if( error == -1 )
{
fprintf(stderr, "%s: ERROR %d binding to %s:%d\n",
__FUNCTION__, WSAGetLastError(), sInterfaceIP, nInterfacePort);
}
//Join multicast group
struct ip_mreq mreq;
mreq.imr_multiaddr.s_addr = inet_addr("225.2.3.13");//( "233.43.202.1" );
mreq.imr_interface.s_addr = inet_addr("10.20.16.90");
//error = setsockopt( m_socketID, IPPROTO_IP, IP_ADD_MEMBERSHIP, reinterpret_cast<char*>( &mreq ), sizeof( mreq ) );
if (error == -1)
{
fprintf(stderr, "%s: ERROR %d trying to join group %s.\n", __FUNCTION__, WSAGetLastError(), "233.43.202.1" );
}
int bufSize = 0, len = sizeof(bufSize), nBufferSize = 10*1024*1024;//8192*1024;
//Resize the buffer
getsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF, (char*)&bufSize, &len );
fprintf(stderr, "getsockopt size before %d\n", bufSize );
fprintf(stderr, "setting buffer size %d\n", nBufferSize );
error = setsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF,
reinterpret_cast<const char*>( &nBufferSize ), sizeof( nBufferSize ) );
if( error != 0 )
{
fprintf(stderr, "%s: ERROR %d setting the receive buffer size to %d.\n",
__FUNCTION__, WSAGetLastError(), nBufferSize );
}
bufSize = 1234, len = sizeof(bufSize);
getsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF, (char*)&bufSize, &len );
fprintf(stderr, "getsockopt size after %d\n", bufSize );
//Non-blocking
u_long op = 1;
ioctlsocket( m_socketID, FIONBIO, &op );
//Create IOCP
HANDLE iocp = CreateIoCompletionPort( INVALID_HANDLE_VALUE, NULL, NULL, 1 );
HANDLE iocp2 = CreateIoCompletionPort( (HANDLE)m_socketID, iocp, 5, 1 );
char buffer[2*1024]={0};
int r = 0;
OVERLAPPED overlapped;
memset(&overlapped, 0, sizeof(overlapped));
DWORD bytes = 0, flags = 0;
// WSABUF buffers[1];
//
// buffers[0].buf = buffer;
// buffers[0].len = sizeof(buffer);
//
// while( (r = WSARecv( m_socketID, buffers, 1, &bytes, &flags, &overlapped, NULL )) != -121 )
//sleep(100000);
while( (r = ReadFile( (HANDLE)m_socketID, buffer, sizeof(buffer), NULL, &overlapped )) != -121 )
{
bytes = 0;
ULONG_PTR key = 0;
LPOVERLAPPED pOverlapped;
if( GetQueuedCompletionStatus( iocp, &bytes, &key, &pOverlapped, INFINITE ) )
{
static unsigned __int64 total = 0, printed = 0;
total += bytes;
if( total - printed > (1024*1024) )
{
printf( "%I64dmb\r", printed/ (1024*1024) );
printed = total;
}
}
}
while( r = recv(m_socketID,buffer,sizeof(buffer),0) )
{
static unsigned int total = 0, printed = 0;
if( r > 0 )
{
total += r;
if( total - printed > (1024*1024) )
{
printf( "%dmb\r", printed/ (1024*1024) );
printed = total;
}
}
}
return 0;
}
I am using Iperf as the sender and comparing the amount of data received to the amount of data sent: iperf.exe -c 10.20.16.90 -u -P 10 -B 10.20.16.51 -b 1000000000 -p 12400 -l 1000
edit: doing iperf to iperf the performance is closer to 180k or so without dropping (8mb client side buffer). If I am doing tcp I can do about 200k packets/second. Here's what interesting though - I can do far more than 200k with multiple tcp connections, but multiple udp connections do not increase the total (I test udp performance with multiple iperfs, since a single iperf with multiple threads doesn't seem to work). All hardware acceleration is tuned on in the drivers.. It seems like udp performance is simply subpar?
I've been doing some UDP testing with similar hardware as I investigate the performance gains that can be had from using the Winsock Registered I/O network extensions, RIO, in Windows 8 Server. For this I've been running tests on Windows Server 2008 R2 and on Windows Server 8.
I've yet to get to the point where I've begun testing with our 10Gb cards (they've only just arrived) but the results of my earlier tests and the example programs used to run them can be found here on my blog.
One thing that I might suggest is that with a simple test like the one you show where there's very little work being done to each datagram you may find that old fashioned, synchronous I/O, is faster than the IOCP design. Whilst the IOCP design steps ahead as the
workload per datagram rises and you can fully utilise the multiple threads.
Also, are your test machines wired back to back (i.e. without a switch) or do they run through a switch; if so, could the issue be down to the performance of your switch rather than your test machines? If you're using a switch, or have multiple nics in the server, can you run multiple clients against the server, could the issue be on the client rather than the server?
What CPU usage are you seeing on the sending and receiving machines? Have you looked at the machine's cpu usage with Process Explorer? This is more accurate than Task Manager. Which CPU is handling the nic interrupts, can you improve things by binding these to another cpu? or changing the affinity of your test program to run on another cpu? Is your IOCP example spreading its threads across multiple NUMA nodes or are you locking all of them to one node?
I'm hoping to get to run some more tests next week and will update my answer when I have done so.
Edit: For me the problem was due to the fact that the NIC drivers had "flow control" enabled and this caused the sender to run at the speed of the receiver. This had some undesirable "non-paged pool" usage characteristics and turning off flow control allows you to see how fast the sender can go (and the difference in network utilisation between the sender and receiver clearly shows how much data is being lost). See my blog posting here for more details.

Reserve disk space before writing a file for efficiency

I have noticed a huge performance hit in one of my projects when logging is enabled for the first time. But when the log file limit is reached and the program starts writing to the beginning of the file again, the logging speed is much faster (about 50% faster). It's normal to set the log file size to hundreds of MBs.
Most download managers allocate dummy file with the required size before starting to download the file. This makes the writing more effecient because the whole chunk is allocated at once.
What is the best way to reserve disk space efficiently, by some fixed size, when my program starts for the first time?
void ReserveSpace(LONG spaceLow, LONG spaceHigh, HANDLE hFile)
{
DWORD err = ::SetFilePointer(hFile, spaceLow, &spaceHigh, FILE_BEGIN);
if (err == INVALID_SET_FILE_POINTER) {
err = GetLastError();
// handle error
}
if (!::SetEndOfFile(hFile)) {
err = GetLastError();
// handle error
}
err = ::SetFilePointer(hFile, 0, 0, FILE_BEGIN); // reset
}
wRAR is correct.
Open a new file using your favourite library, then seek to the penultimate byte and write a 0 there. That should allocate all the required disk space.
If you are using C++ 17, you should do it with std::filesystem::resize_file
Link
Changes the size of the regular file named by p as if by POSIX truncate: if the file size was previously larger than new_size, the remainder of the file is discarded. If the file was previously smaller than new_size, the file size is increased and the new area appears as if zero-filled.
#include <iostream>
#include <iomanip>
#include <fstream>
#include <filesystem>
namespace fs = std::filesystem;
int main()
{
fs::path p = fs::current_path() / "example.bin";
std::ofstream(p).put('a');
fs::resize_file(p, 1024*1024*1024); // resize to 1 G
}
You can use the SetFileValidData function to extend the logical length of a file without having to write out all that data to disk. However, because it can allow to read disk data to which you may not otherwise have been privileged, it requires the SE_MANAGE_VOLUME_NAME privilege to use. Carefully read the Remarks section of the documentation.
Also implementation of SetFileValidData depend on fs driver.
NTFS support it and FAT only since Win7.
If you're using C#, you can call SetLength on a FileStream to instantly set an initial size to the file.
e.g.
using (var fileStream = File.Open(#"file.txt", FileMode.Create, FileAccess.Write, FileShare.Read))
{
fileStream.SetLength(1024 * 1024 * 1024); // Reserve 1 GB
}
Here's a simple function that will work for files of any size:
void SetFileSize(HANDLE hFile, LARGE_INTEGER size)
{
SetFilePointer(hFile, size, NULL, FILE_BEGIN);
SetEndOfFile(hFile);
}

Resources