Coding for Input/Output: Speed or Memory priority?

Coding for Input/Output: Speed or Memory priority? - performance

I am currently writing a simple piece of IO parsing, and is in a dilemma as to how should I code it.
This is the case of a web application, where this particular parsing function may be called multiple times within a second by several users.
Assume that the file size is more than 2 MB and Hardware IO delays are 5ms for each call.
First Case: Memory
The first case would be to code for memory, but at the expense of speed. The function will take in small parts of the file and parse by the parts thus using more iterations, but less memory.
Pseudo-code:
function parser() {
Open file and put into handle variable fHandle
while (file position not passed EOF) {
read 1024 bytes from file using fHandle into variable data
process(data)
}
Close file using handle fHandle
}
Second Case: Speed
The second case would be to code for speed, at the expense of memory usage. The function will load the entire file content into memory and parse it directly.
Pseudo-code:
function parser() {
read entire file and store into variable data
declare parsing position variable and set to 0
while (parsing position not past data length) {
get position of next token and store into variable pos
process( substring from current position to pos of data )
}
}
Note: when reading entire file we are using library direct-available functions to read the entire file. No loops are used in reading the file on the developer's end.
Third Case: End-user choice
Would it then be advisable to write for both, and whenever the function runs, the function will detect whether memory is abundant or not. If there is a lot of free memory space, the function will use the memory-intensive version.
Pseudo-code:
function parser() {
if (memory is too little) {
Open file and put into handle variable fHandle
while (file position not passed EOF) {
read 1024 bytes from file using fHandle into variable data
process(data)
}
Close file using handle fHandle
} else {
read entire file and store into variable data
declare parsing position variable and set to 0
while (parsing position not past data length) {
get position of next token and store into variable pos
process( substring from current position to pos of data )
}
}
}

Use asynchronous I/O (or a second thread), and process one chunk of data while the drive's busy fetching the next chunk. Best of both worlds.

If you need to read the full file either way and it fits into memory without issue, then read it from memory. Will it be the same file every time, or some small set of files? Cache them in memory.

If the input for your parsing comes from I/O, as it usually does, any good parsing technology, like recursive-descent, will be I/O bound.
In other words, the average time to get a character from the I/O should exceed the average time spent processing it, by a healthy factor.
So it really doesn't matter very much.
The only difference will be in how much working storage you glom onto, which is not usually a big deal.

Related

Go performance penalty in high number of calls to append

I'm writing an emulator in Go, and for debugging purposes I'm logging the cpu' state at every emulator's cycle to generate a log file later.
There's something I'm not doing properly because while the logger is enabled performance drops and makes the emulator unusable.
Profiler shows clearly the culprit resides in the logging routine (logStep method):
logStep method is very simple, it calls CreateState to snapshot current cpu state in a struct, and then adds it to a slice (in method Log).
I call this method at every emulated cpu cycle (around 30.000 times per second), and I suspect either Garbage Collector is slowing my execution or I'm doing something wrong with this data structure.
I get the profile graph is pointing me to runtime growslice caused by an append located in (*cpu6502Logger)Log, but I'm unable to find information on how to do this more efficiently.
Also, I scratch my head on why CreateState takes that long to just create a simple struct.
This is what CpuState looks like:
type CpuState struct {
Registers Cpu6502Registers
CurrentInstruction Instruction
RawOpcode [4]byte
EvaluatedAddress Address
CyclesSinceReset uint32
}
This is how I create a CPU Snapshot:
func CreateState(cpu Cpu6502) CpuState {
pc := cpu.Registers().Pc
var rawOpcode [4]byte
rawOpcode[0] = 0x00
pc++
instruction := cpu.instructions[rawOpcode[0]]
for i := byte(0); i < (instruction.Size() - 1); i++ {
rawOpcode[1+i] = cpu.memory.Read(pc+Address(i))
}
_, evaluatedAddress, _, _ := cpu.addressEvaluators[instruction.AddressMode()](pc)
state := CpuState{
*cpu.Registers(),
instruction,
rawOpcode,
evaluatedAddress,
cpu.cycle,
}
return state
}
And finally, how I add this snapshot to a collection (log method in the profile graph). I've also addde how I initialize logger.snapshots:
func createCPULogger(outputPath string) cpu6502Logger {
return cpu6502Logger{
outputPath: outputPath,
snapshots: make([]CpuState, 0, 10024),
}
}
func (logger *cpu6502Logger) Log(state CpuState) {
logger.snapshots = append(logger.snapshots, state)
}

Disclaimer: following text contains grammar mistakes but i dont give a damn
why is it slow
Maintaining one gigantic slice to hold all data there is is wery costy mainly when it constantly extends. Each time you append few elements, whole memory section is copied to bigges section to allow expansion. with grownig slice, complexity grows and each realocation is slower and slower. You told us that you emulate tousands of cpu states per second.
solution
The best way to deal with this is allocating fixed buffer of some length. Now we now that eventually we will run out of space. When that happens we have two options. First you can write all data ftom buffer to file then truncate the buffer and start filling again (then write again). Other option is to save filled buffers in a slice and allocate new one. Choos witch one fits your machine. (slow or small ram is not good for second solution)
why does this help
i think this also helps the emulator it self. There will be performance spikes when restoring buffer, but most of the time, performance will be at maximum. Allocating big memory is just slow as alocator is less likely to find fitting section on first try. Garbage collection is also wery unhappy with frequent allocations. By allocating buffer and filling it, we use one big allocation, (but not too big), and store data in sections. Sections we already saved can stey where they are. We can also say that in this case we are handling memory our selfs more then gc does. (no garbage memory produced)

Using SetFilePointer to change the location to write in the sector doesn't work?

I'm using SetFilePointer to rewrite the second half of the MBR with something, its a user-mode application and i opened a handle to PhysicalDrive
At first i tried to set the size parameter in WriteFile to 256 but the writefile gave the INVALID_PARAMETER error, as it turns out based on some search on other questions here it seems like this is because we are forced to write in multiplicand of the sector size when the handle is PhysicalDrive for some reason
then i tried to set the filePointer to 256, and Write 512 bytes, both of them return no error, but for some unknown reason it writes from the beginning of the sector! as if the SetFilePointer didn't work even tho the return value of SetFilePointer is OK and it returns 256
So my questions is :
Why the write size have to be multiplicand of sector size when the handle is PhysicalDrive? which other device handles are like this?
Why is this happening and when I set the file pointer to 256, WriteFile still writes from the start?
isn't this really redundant, considering that even if I want to change 1 byte then I have to read the entire sector, change the one byte and then write it back, instead of just writing 1 byte, it seems like 10 times more overhead! isn't there a faster way to write a few bytes in a sector?

I think you are mixing the file system and the storage (block device). File system stays above storage device stack. If your code obtains a handle to a file system device, you can write byte by byte. But if you are accessing storage device stack, you can only write sector by sector (or block size).
Directly writing to block device is definitely slow as you discovered. However, in most cases, people just talk to file systems. Most file system drivers maintain cache and use algorithms for both read and write to improve performance.
Can't comment on file pointer based offset before seeing the actual code. But I guess it might be not sector aligned or it's not used at all.

How to handle for loop with large objects in Rstudio?

I have a for loop with large objects. According to my trial-and-error, I can only load the large object once. If I load the object again, I would be returned the error "Error: cannot allocate vector of size *** Mb". I tried to overcome this issue by removing the object at the end of the for loop. However, I am still returned the error "Error: cannot allocate vector of size 699.2 Mb" at the beginning of the second run of the for loop.
My for loop has the following structure:
for (i in 1:22) {
VeryLargeObject <- ...i...
...
.
.
.
...
rm(VeryLargeOjbect)
}
The VeryLargeObjects ranges from 2-3GB. My PC has RAM of 16Gb, 8 cores, 64-bit Win10.
Any solution on how I can manage to complete the for loop?

The error "cannot allocate..." likely comes from the fact that rm() does not immediately free memory. So the first object still occupies RAM when you load the second one. Objects that are not assigned to any name (variable) anymore get garbage collected by R at time points that R decides for itself.
Most remedies come from not loading the entire object into RAM:
If you are working with a matrix, create a filebacked.big.matrix() with the bigmemory package. Write your data into this object using var[...,...] syntax like a normal matrix. Then, in a new R session (and a new R script to preserve reproducibility), you can load this matrix from disk and modify it.
The mmap package uses a similar approach, using your operating system's ability to map RAM pages to disk. So they appear to a program like they are in ram, but are read from disk. To improve speed, the operating system takes care of keeping the relevant parts in RAM.
If you work with data frames, you can use packages like fst and feather that enable you to load only parts of your data frame into a variable.
Transfer your data frame into a data base like sqlite and then access the data base with R. The package dbplyr enables you to treat a data base as a tidyverse-style data set. Here is the RStudio help page. You can also use raw SQL commands with the package DBI
Another approach is to not write interactively, but to write an R script that processes only one of your objects:
Write an R script, named, say processBigObject.R that gets the file name of your big object from the command line using commandArgs():
#!/usr/bin/env Rscript
#
# Process a big object
#
# Usage: Rscript processBigObject.R <FILENAME>
input_filename <- commandArgs(trailing = TRUE)[1]
output_filename <- commandArgs(trailing = TRUE)[2]
# I'm making up function names here, do what you must for your object
o <- readBigObject(input_filename)
s <- calculateSmallerSummaryOf(o)
writeOutput(s, output_filename)
Then, write a shell script or use system2() to call the script multiple times, with different file names. Because R is terminated after each object, the memory is freed:
system2("Rscript", c("processBigObject.R", "bigObject1.dat", "bigObject1_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject2.dat", "bigObject2_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject3.dat", "bigObject3_result.dat"))
...

Windows (ReFS,NTFS) file preallocation hint

Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)

Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.

Why does the memory mapped file ever need to be flushed when access is RDWR?

I was reading through one of golang's implementation of memory mapped files, https://github.com/edsrzf/mmap-go/. First he describes the several access modes:
// RDONLY maps the memory read-only.
// Attempts to write to the MMap object will result in undefined behavior.
RDONLY = 0
// RDWR maps the memory as read-write. Writes to the MMap object will update the
// underlying file.
RDWR = 1 << iota
// COPY maps the memory as copy-on-write. Writes to the MMap object will affect
// memory, but the underlying file will remain unchanged.
COPY
But in gommap test file I see this:
func TestReadWrite(t *testing.T) {
mmap, err := Map(f, RDWR, 0)
... omitted for brevity...
mmap[9] = 'X'
mmap.Flush()
So why does he need to call Flush to make sure the contents are written to the file if the access mode is RDWR?
Or is the OS managing this so it only writes when it thinks it should?
If the last option, could you please explain it in a little more detail - what i read is that when the OS is low in memory it writes to the file and frees up memory. Is this correct and does it apply only to RDWR or only to COPY?
Thanks

The program maps a region of memory using mmap. It then modifies the mapped region. The system isn't required to write those modifications back to the underlying file immediately, so a read call on that file (in ioutil.ReadAll) could return the prior contents of the file.
The system will write the changes to the file at some point after you make the changes.
It is allowed to write the changes to the file any time after the changes are made, but by default makes no guarantees about when it writes those changes. All you know is that (unless the system crashes), the changes will be written at some point in the future.
If you need to guarantee that the changes have been written to the file at some point in time, then you must call msync.
The mmap.Flush function calls msync with the MS_SYNC flag. When that system call returns, the system has written the modifications to the underlying file, so that any subsequent call to read will read the modified file.
The COPY option sets the mapping to MAP_PRIVATE, so your changes will never be written back to the file, even if you using msync (via the Flush function).
Read the POSIX documentation about mmap and msync for full details.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio