Will Buffering Data Improve the Write Performance in IStream? - windows

I am trying to create an IStream from a disk file, then write many blocks of data into it. Each block is of 4096 byte size. Since there are many blocks to be written, I just thinking of buffering several blocks into one large buffer, and when the large buffer is full, then I perform the actual write to IStream. Will that improve the write performance?
I try to make a small app to test that:
CFile SrcFile;
LPSTREAM lpStreamFile = NULL;
BYTE lpBuf[4096];
BYTE lpLBuf[65536];
BYTE* lpCurBuf;
UINT uRead;
DWORD uStart, uStop;
uStart = ::GetTickCount();
if (SrcFile.Open(_T("D:\\1GB.dat"), CFile::modeRead | CFile::shareExclusive | CFile::typeBinary))
{
SrcFile.SeekToBegin();
::DeleteFile(_T("F:\\1GB.dat"));
if (SUCCEEDED(SHCreateStreamOnFileEx(_T("F:\\1GB.dat"), STGM_READWRITE | STGM_SHARE_EXCLUSIVE | STGM_CREATE | STGM_DIRECT,
FILE_ATTRIBUTE_NORMAL, TRUE, NULL, &lpStreamFile)) && (lpStreamFile != NULL))
{
lpCurBuf = lpLBuf;
while (TRUE)
{
uRead = SrcFile.Read(lpBuf, 4096);
if ((lpCurBuf + uRead) > (lpLBuf + 65536))
{
lpStreamFile->Write(lpLBuf, lpCurBuf - lpLBuf, NULL);
lpCurBuf = lpLBuf;
}
::memcpy(lpCurBuf, lpBuf, uRead);
lpCurBuf += uRead;
// lpStreamFile->Write(lpBuf, 4096, NULL);
if (uRead < 4096)
break;
}
if (lpCurBuf > lpLBuf)
{
lpStreamFile->Write(lpLBuf, lpCurBuf - lpLBuf, NULL);
}
lpStreamFile->Commit(STGC_DEFAULT);
if (lpStreamFile != NULL)
{
// Release the stream
lpStreamFile->Release();
lpStreamFile = NULL;
}
}
SrcFile.Close();
}
uStop = GetTickCount();
CString strMsg;
strMsg.Format(_T("Total tick count = %u."), uStop - uStart);
AfxMessageBox(strMsg);
The large buffer size 65536, 16 times of 4096. I use a 1GB data file to do the test.
To my surprise, when large buffer is not used, then the total time is always about 10sec.
However, when large buffer is used, the first run will take 30sec, then the remaining run will take 10sec.
So it seems that IStream already has an internal written buffer, so that my large buffer does not do any benefit to it. However, I cannot find any document saying that. Neither can I find any document telling me how to control, for example, set the size of the buffer.

Related

Which operations update last access time?

Assuming given filesystem is tracking Last Access Time (aka atime) -- which operations on a file cause atime to update?
As far as I know:
opening existing file (and subsequent closing related handle/fd) does not update atime
reading/writing file will update atime (I wonder if read-0-bytes operation does that)
reading file security descriptor (via related Win32 API) does not update atime or other file attributes
Is there an exhaustive list of operations that update atime?
The last access time includes the last time the file or directory was written to, read from, or, in the case of executable files, run.
Other operations, like accessing the file to retrieve properties to show in Explorer
or some other viewer, accessing the file to retrieve its icon etc. don't update last access time.
Refer to "GetFileTime - lpLastAccessTime", "How do I access a file without updating its last-access time?"
Update: Add test results of read/write 0 bytes and read/write 1 bytes.
Code used for testing:
void GetLastAccessTime(HANDLE hFile)
{
FILETIME ftAccess;
SYSTEMTIME stUTC, stLocal;
printf("Get last access time\n");
// Retrieve the file times for the file.
if (!GetFileTime(hFile, NULL, &ftAccess, NULL))
return;
// Convert the last-write time to local time.
FileTimeToSystemTime(&ftAccess, &stUTC);
SystemTimeToTzSpecificLocalTime(NULL, &stUTC, &stLocal);
// Build a string showing the date and time.
wprintf(
L"%02d/%02d/%d %02d:%02d \n",
stLocal.wMonth, stLocal.wDay, stLocal.wYear,
stLocal.wHour, stLocal.wMinute);
}
int main()
{
HANDLE tFile = INVALID_HANDLE_VALUE;
printf("Open file\n");
// Open file
tFile = CreateFile(L"C:\\Users\\ritah\\Desktop\\test1.txt", GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_EXISTING, 0, NULL);
if (INVALID_HANDLE_VALUE == tFile)
{
printf("CreateFile fails with error: %d\n", GetLastError());
getchar();
return 0;
}
printf("Sleep 60 seconds\n");
Sleep(60000);
GetLastAccessTime(tFile);
// Read 0 bytes
printf("Read 0 bytes\n");
WCHAR redBuf[10];
DWORD redBytes = 0;
if(!ReadFile(tFile, redBuf, 0, &redBytes, NULL))
{
printf("ReadFile fails with error: %d\n", GetLastError());
getchar();
return 0;
}
printf("Sleep 60 seconds\n");
Sleep(60000);
GetLastAccessTime(tFile);
// Write 0 bytes
printf("Write 0 bytes\n");
WCHAR writeBuf[] = L"write test";
DWORD writeBytes = 0;
if(!WriteFile(tFile, writeBuf, 0, &writeBytes, NULL))
{
printf("WriteFile fails with error: %d\n", GetLastError());
getchar();
return 0;
}
printf("Sleep 60 seconds\n");
Sleep(60000);
GetLastAccessTime(tFile);
getchar();
}
So, read/write 0 bytes doesn't update last access time.

MacOS shm - Unable to get true data size in shm

When performing shm-related development on MacOS, the searched processes are shown in the following code (verification is indeed correct).
However, there is a new problem that cannot be solved. It is found that when ftruncat adjusts the memory size for shm_fd, it is allocated according to the multiple of the page size.
But in this case, when the shared memory file is opened by other processes, the actual data size cannot be obtained correctly. The obtained file size is an integer multiple of the page, which will cause an error when appending data.
// write data_size = 12
char *data = "....";
long data_size = 12;
shmFD = shm_open(...);
ftruncate(shmFD, data_size); // Actually the size actually allocated is not 12, but 4096
shmAddr = (char *)mmap(NULL, data_size, ... , shmFD, 0);
memcpy(shmAddr, data, data_size);
// read
...
fstat(shmFD, &sb)
long context_len_in_shm = sb.st_size;
// get wrong shm size -> context_len_in_shm = 4096
Temporarily use the following structure to record data into shm. The first operation before writing or reading is to get the value of the data_len field, and then determine the length of the data to be read and written from the back. Hope for a more concise way, just like the use of lseek() under Linux.
shm mem map :
----shm mem----
struct {
long data_len;
data[1];
data[2];
...
data[data_len];
}
---------------
long *shm_mem = (long *)shmAddr;
long data_size = shm_mem[0]; // Before reading, you need to determine whether the shm file is empty and whether the pointer is valid. It is omitted here.
char *shm_data = (char *)&(shm_mem[1]);
char *buffer = (char *)malloc(data_size);
memcpy(buffer, shm_data, data_size);

Non-blockings reads/writes to stdin/stdout in C on Linux or Mac

I have two programs communicating via named pipes (on a Mac), but the buffer size of named pipes is too small. Program 1 writes 50K bytes to pipe 1 before reading pipe 2. Named pipes are 8K (on my system) so program 1 blocks until the data is consumed. Program 2 reads 20K bytes from pipe 1 and then writes 20K bytes to pipe2. Pipe2 can't hold 20K so program 2 now blocks. It will only be released when program 1 does its reads. But program 1 is blocked waiting for program 2. deadlock
I thought I could fix the problem by creating a gasket program that reads stdin non-blocking and writes stdout non-blocking, temporarily storing the data in a large buffer. I tested the program using cat data | ./gasket 0 | ./gasket 1 > out, expecting out to be a copy of data. However, while the first invocation of gasket works as expected, the read in the second program returns 0 before all the data is consumed and never returns anything other than 0 in follow on calls.
I tried the code below both on a MAC and Linux. Both behave the same. I've added logging so that I can see that the fread from the second invocation of gasket starts getting no data even though it has not read all the data written by the first invocation.
#include <stdio.h>
#include <fcntl.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#define BUFFER_SIZE 100000
char buffer[BUFFER_SIZE];
int elements=0;
int main(int argc, char **argv)
{
int total_read=0, total_write=0;
FILE *logfile=fopen(argv[1],"w");
int flags = fcntl(fileno(stdin), F_GETFL, 0);
fcntl(fileno(stdin), F_SETFL, flags | O_NONBLOCK);
flags = fcntl(fileno(stdout), F_GETFL, 0);
fcntl(fileno(stdout), F_SETFL, flags | O_NONBLOCK);
while (1) {
int num_read=0;
if (elements < (BUFFER_SIZE-1024)) { // space in buffer
num_read = fread(&buffer[elements], sizeof(char), 1024, stdin);
elements += num_read;
total_read += num_read;
fprintf(logfile,"read %d (%d) elements \n",num_read, total_read); fflush(logfile);
}
if (elements > 0) { // something in buffer that we can write
int num_written = fwrite(&buffer[0],sizeof(char),elements, stdout); fflush(stdout);
total_write += num_written;
fprintf(logfile,"wrote %d (%d) elements \n",num_written, total_write); fflush(logfile);
if (num_written > 0) { // copy data to top of buffer
for (int i=0; i<(elements-num_written); i++) {
buffer[i] = buffer[i+num_written];
}
elements -= num_written;
}
}
}
}
I guess I could make the gasket multi-threaded and use blocking reads in one thread and blocking writes in the other, but I would like to understand why non-blocking IO seems to break for me.
Thanks!
My general solution to any IPC project is to make the client and server non-blocking I/O. To do so requires queuing data both on writing and reading, to handle cases where the OS can't read/write, or can only read/write a portion of your message.
The code below will probably seem like EXTREME overkill, but if you get it working, you can use it the rest of your career, whether for named pipes, sockets, network, you name it.
In pseudo-code:
typedef struct {
const char* pcData, * pcToFree; // pcData may no longer point to malloc'd region
int iToSend;
} DataToSend_T;
queue of DataToSend_T qdts;
// Caller will use malloc() to allocate storage, and create the message in
// that buffer. MyWrite() will free it now, or WritableCB() will free it
// later. Either way, the app must NOT free it, and must not even refer to
// it again.
MyWrite( const char* pcData, int iToSend ) {
iSent = 0;
// Normally the OS will tell select() if the socket is writable, but if were hugely
// compute-bound, then it won't have a chance to. So let's call WritableCB() to
// send anything in our queue that is now sendable. We have to send the data in
// order, of course, so can't send the new data until the entire queue is done.
WritableCB();
if ( qdts has no entries ) {
iSent = write( pcData, iToSend );
// TODO: check error
// Did we send it all? We're done.
if ( iSent == iToSend ) {
free( pcData );
return;
}
}
// OK, either 1) we had stuff queued already meaning we can't send, or 2)
// we tried to send but couldn't send it all.
add to queue qdts the DataToSend ( pcData + iSent, pcData, iToSend - iSent );
}
WritableCB() {
while ( qdts has entries ) {
DataToSend_T* pdts = qdts head;
int iSent = write( pdts->cData, pdts->iToSend );
// TODO: check error
if ( iSent == pdts->iToSend ) {
free( pdts->pcToFree );
pop the front node off qdts
else {
pdts->pcData += iSent;
pdts->iToSend -= iSent;
return;
}
}
}
// Off-subject but I like a TINY buffer as an original value, that will always
// exercise the "buffer growth" code for almost all usage, so we're sure it works.
// If the initial buffer size is like 1M, and almost never grows, then the grow code
// may be buggy and we won't know until there's a crash years later.
int iBufSize = 1, iEnd = 0; iEnd is the first byte NOT in a message
char* pcBuf = malloc( iBufSize );
ReadableCB() {
// Keep reading the socket until there's no more data. Grow buffer if necessary.
while (1) {
int iRead = read( pcBuf + iEnd, iBufSize - iEnd);
// TODO: check error
iEnd += iRead;
// If we read less than we had space for, then read returned because this is
// all the available data, not because the buffer was too small.
if ( iRead < iBufSize - iEnd )
break;
// Otherwise, double the buffer and try reading some more.
iBufSize *= 2;
pcBuf = realloc( pcBuf, iBufSize );
}
iStart = 0;
while (1) {
if ( pcBuf[ iStart ] until iEnd-1 is less than a message ) {
// If our partial message isn't at the front of the buffer move it there.
if ( iStart ) {
memmove( pcBuf, pcBuf + iStart, iEnd - iStart );
iEnd -= iStart;
}
return;
}
// process a message, and advance iStart by the size of that message.
}
}
main() {
// Do your initial processing, and call MyWrite() to send and/or queue data.
while (1) {
select() // see man page
if ( the file handle is readable )
ReadableCB();
if ( the file handle is writable )
WritableCB();
if ( the file handle is in error )
// handle it;
if ( application is finished )
exit( EXIT_SUCCESS );
}
}

Azure Page Blob OpenRead does not fetch more than StreamMinimumReadSizeInBytes

I have a page blob containing effectively log data. Everything works fine until the log fills up past 2 MB.
When Reading, I'm using the OpenReadAsync method to get a stream from which I read data out of. Prior to calling OpenReadAsync, I set StreamMinimumReadSizeInBytes to 2MB (2 * 1024 * 1024).
After opening the stream, I use the following method to read data out.
public IEnumerable<object> Read(Stream pageAlignedEventStream, long? maxBytes = null)
{
while (pageAlignedEventStream.Position < (maxBytes ?? pageAlignedEventStream.Length))
{
byte[] bytesToReadBuffer = new byte[LongZero.Length];
pageAlignedEventStream.Read(bytesToReadBuffer, 0, LongZero.Length);
long bytesToRead = BitConverter.ToInt64(bytesToReadBuffer, 0);
if (bytesToRead == 0)
{
yield break;
}
if (bytesToRead < 0)
{
throw new InvalidOperationException("Invalid size specification. Stream may be corrupted.");
}
if (bytesToRead > Int32.MaxValue)
{
throw new InvalidOperationException("Payload size is too large.");
}
byte[] payload = new byte[bytesToRead];
int read = pageAlignedEventStream.Read(payload, 0, (int) bytesToRead);
if (read != bytesToRead)
{
// when fails, read == 503, bytesToRead = 3575, position = 2MB (2*1024*14024)
throw new InvalidOperationException("Did not read expected number of bytes.");
}
yield return this.EventSerializer.DeserializeFromStream(new MemoryStream(payload, false));
var paddedSpaceToSkip = PagesRequired(bytesToRead) * PageSizeBytes - bytesToRead - LongZero.Length;
pageAlignedEventStream.Position += paddedSpaceToSkip;
}
yield break;
}
As noted in the comments in the code, the failure happends when the position reaches the 2MB specified. The read fails to pull additional bytes before returning and only reads 503 bytes instead of the expected 3575 bytes.
My expectation was that as I read past the buffer size, it would download more data.
I found a similar issue on Azure Feedback, but linked issue indicates a non-power-of-2 buffersize but 2MB is definitely power of 2.
I could fetch the all data (Size=3MB) that stored in a page blob even though I set StreamMinimumReadSizeInBytes property of CloudPageBlob to 2MB.
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobClient.GetContainerReference("mycontainername");
container.CreateIfNotExists();
CloudPageBlob pageBlob = container.GetPageBlobReference("mypageblob");
pageBlob.StreamMinimumReadSizeInBytes = 2 * 1024 * 1024;
Task<Stream> pageAlignedEventStream = pageBlob.OpenReadAsync();
The read fails to pull additional bytes before returning and only reads 503 bytes instead of the expected 3575 bytes.
If that many bytes are not currently available and the end of the stream has been reached, the returned value could be less than the number of bytes requested. Please debug your code to trace the changes of variable of paddedSpaceToSkip and check whether your code logic is as expected.

optimization of sequential i/o operations on large file sizes

Compiler: Microsoft C++ 2005Hardware: AMD 64-bit (16 GB)
Sequential, read-only access from an 18GB file is committed with the following timing, file access, and file structure characteristics:18,184,359,164 (file length)11,240,476,672 (ntfs compressed file length)
Time File Method Disk
14:33? compressed fstream fixed disk
14:06 normal fstream fixed disk
12:22 normal winapi fixed disk
11:47 compressed winapi fixed disk
11:29 compressed fstream ram disk
10:37 compressed winapi ram disk
7:18 compressed 7z stored decompression to ntfs 12gb ram disk
6:37 normal copy to same volume fixed disk
The fstream constructor and access:
define BUFFERSIZE 524288
unsigned int mbytes = BUFFERSIZE;
char * databuffer0; databuffer0 = (char*) malloc (mbytes);
datafile.open("drv:/file.ext", ios::in | ios::binary );
datafile.read (databuffer0, mbytes);
The winapi constructor and access:
define BUFFERSIZE 524288
unsigned int mbytes = BUFFERSIZE;
const TCHAR* const filex = _T("drv:/file.ext");
char ReadBuffer[BUFFERSIZE] = {0};
hFile = CreateFile(filex, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if( FALSE == ReadFile(hFile, ReadBuffer, BUFFERSIZE-1, &dwBytesRead, NULL))
{ ...
For the fstream method, -> 16MB buffer sizes do not decrease processing time. All buffer sizes beyond .5MB fail for the winapi method. What methods would optimize this implementation versus processing time?
Did you try memory-mapping the file? In my test this was always the fastest way to read large files.
Update: Here's an old, but still accurate description of memory mapped files:
http://msdn.microsoft.com/en-us/library/ms810613.aspx
Try this.
hf = CreateFile(..... FILE_FLAG_NO_BUFFERING | FILE_FLAG_OVERLAPPED ...)
Then the reading loop. Minor details omitted as typing on iPad...
int bufsize =4*1024*1024;
CEvent e1;
CEvent e2;
CEvent e3;
CEvent e4;
unsigned char* pbuffer1 = malloc(bufsize);
unsigned char* pbuffer2 = malloc(bufsize);
unsigned char* pbuffer3 = malloc(bufsize);
unsigned char* pbuffer4 = malloc(bufsize);
int CurOffset = 0;
do {
OVERLAPPED r1;
memset(&r1, 0, sizeof(OVERLAPPED));
r1.Offset = CurOffset;
CurOffset += bufsize;
r1.hEvent = e1;
if (! ReadFile(hf, pbuffer1, bufsize, bufsize, &r1)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r2;
memset(&r2, 0, sizeof(OVERLAPPED));
r2.Offset = CurOffset;
CurOffset += bufsize;
r2.hEvent = e2;
if (! ReadFile(hf, pbuffer2, bufsize, bufsize, &r2)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r3;
memset(&r3, 0, sizeof(OVERLAPPED));
r3.Offset = CurOffset;
CurOffset += bufsize;
r3.hEvent = e3;
if (! ReadFile(hf, pbuffer3, bufsize, bufsize, &r3)) {
// check for error AND error_handle_eof (important)
}
OVERLAPPED r4;
memset(&r4, 0, sizeof(OVERLAPPED));
r4.Offset = CurOffset;
CurOffset += bufsize;
r4.hEvent = e4;
if (! ReadFile(hf, pbuffer1, bufsize, bufsize, &r4)) {
// check for error AND error_handle_eof (important)
}
// wait for events to indicate data present
// send data to consuming threads
// allocate new buffer
} while ( not eof, etc )
The above is the bones of what you need. We use this and achieve high I/O throughput rates, but you will need to perhaps improve it slightly to achieve ultimate performance. We found 4 outstanding I/O was best for our use, but this will vary by platform. Reading less than 1Mb per IO was performance negative. Once you have the buffer read, don't ty and consume it in the reading loop, post to another thread, and allocate another buffer (but get them from a reuse queue, dont keep using malloc). The overall intent of the above is to try and keep 4 outstanding IO open to the disk, as soon as you don't have this, overall performance will drop.
Also, this works best on a disk that is only Reading your file. If you start reading/writing different files on the same disk at same time, performance drops quickly, unless you have SSD disks!
Not sure why your readfile is failing for 0.5Mb buffers, just double checked and our live prod code is using 4Mb buffers

Resources