Write operation cost - performance

I have a Go program which writes strings into a file.I have a loop which is iterated 20000 times and in each iteration i am writing around 20-30 strings into a file. I just wanted to know which is the best way to write it into a file.
Approach 1: Keep open the file pointer at the start of the code and
write it for every string. It makes it 20000*30 write operations.
Approach 2: Use bytes.Buffer Go and store everything in the buffer and
write it at the end.Also in this case should the file pointer be
opened from the beginning of the code or at the end of the code. Does
it matter?
I am assuming approach 2 should work better. Can someone confirm this with a reason. How does writing at once be better than writing periodically. Because the file pointer will anyways be open.
I am using f.WriteString(<string>) and buffer.WriteString(<some string>) buffer is of type bytes.Buffer and f is the file pointer open.

bufio package has been created exactly for this kind of task. Instead of making a syscall for each Write call bufio.Writer buffers up to a fixed number of bytes in the internal memory before making a syscall. After a syscall the internal buffer is reused for the next portion of data
Comparing to your second approach bufio.Writer
makes more syscalls (N/S instead of 1)
uses less memory (S bytes instead of N bytes)
where S - is buffer size (can be specified via bufio.NewWriterSize), N - total size of data that needs to be written.
Example usage (https://play.golang.org/p/AvBE1d6wpT):
f, err := os.Create("file.txt")
if err != nil {
defer f.Close()
w := bufio.NewWriter(f)
fmt.Fprint(w, "Hello, ")
fmt.Fprint(w, "world!")
err = w.Flush() // Don't forget to flush!
if err != nil {

The operations that take time when writing in files are the syscalls and the disk I/O. The fact that the file pointer is open doesn't cost you anything. So naively, we could say that the second method is best.
Now, as you may know, you OS doesn't directly write into files, it uses an internal in-memory cache for files that are written and do the real I/O later. I don't know the exacts details of that, and generally speaking I don't need to.
What I would advise is a middle-ground solution: do a buffer for every loop iteration, and write this one N times. That way to cut a big part of the number of syscalls and (potentially) disk writes, but without consuming too much memory with the buffer (dependeing on the size of your strings, that my be a point to be taken into account).
I would suggest benchmarking for the best solution, but due to the caching done by the system, benchmarking disk I/O is a real nightmare.

Syscalls are not cheap, so the second approach is better.
You can use lat_syscall tool from lmbench to measure how long it takes to call single write:
$ ./lat_syscall write
Simple write: 0.1522 microseconds
So, on my system it will take approximately 20000 * 0.15μs = 3ms extra time just to call write for every string.


How can I read the received packets with a NDIS filter driver?

I am currently experimenting with the NDIS driver samples.
I am trying to print the packets contents (including the MAC-addresses, EtherType and the data).
My first guess was to implement this in the function FilterReceiveNetBufferLists. Unfortunately I am not sure how to extract the packets contents out of the NetBufferLists.
That's the right place to start. Consider this code:
void FilterReceiveNetBufferLists(..., NET_BUFFER_LIST *nblChain, ...)
UCHAR buffer[14];
UCHAR *header;
for (NET_BUFFER_LIST *nbl = nblChain; nbl; nbl = nbl->Next) {
header = NdisGetDataBuffer(nbl->FirstNetBuffer, sizeof(buffer), buffer, 1, 1);
if (!header)
DbgPrint("MAC address: %02x-%02x-%02x-%02x-%02x-%02x\n",
header[0], header[1], header[2],
header[3], header[4], header[5]);
NdisFIndicateReceiveNetBufferLists(..., nblChain, ...);
There are a few points to consider about this code.
The NDIS datapath uses the NET_BUFFER_LIST (nbl) as its primary data structure. An nbl represents a set of packets that all have the same metadata. For the receive path, nobody really knows much about the metadata, so that set always has exactly 1 packet in it. In other words, the nbl is a list... of length 1. For the receive path, you can count on it.
The nbl is a list of one or more NET_BUFFER (nb) structures. An nb represents a single network frame (subject to LSO or RSC). So the nb corresponds most closely to what you think of as a packet. Its metadata is stored on the nbl that contains it.
Within an nb, the actual packet payload is stored as one or more buffers, each represented as an MDL. Mentally, you should pretend the MDLs are just concatenated together. For example, the network headers might be in one MDL, while the rest of the payload might be in another MDL.
Finally, for performance, NDIS gives as many NBLs to your LWF as possible. This means there's a list of one or more NBLs.
Put it all together, and you have:
Your function receives a list of NBLs.
Each NBL contains exactly 1 NB (on the receive path).
Each NB contains a list of MDLs.
Each MDL points to a buffer of payload.
So in our example code above, the for-loop iterates along that first bullet point: the chain of NBLs. Within the loop, we only need to look at nbl->FirstNetBuffer, since we can safely assume there is no other nb besides the first.
It's inconvenient to have to fiddle with all those MDLs directly, so we use the helper routine NdisGetDataBuffer. You tell this guy how many bytes of payload you want to see, and he'll give you a pointer to a contiguous range of payload.
In the good case, your buffer is contained in a single MDL, so NdisGetDataBuffer just gives you a pointer back into that MDL's buffer.
In the slow case, your buffer straddles more than one MDL, so NdisGetDataBuffer carefully copies the relevant bit of payload into a scratch buffer that you provided.
The latter case can be fiddly, if you're trying to inspect more than a few bytes. If you're reading all 1500 bytes of the packet, you can't just allocate 1500 bytes on the stack (kernel stack space is scarce, unlike usermode), so you have to allocate it from the pool. Once you figure that out, note it will slow things down to copy all 1500 bytes of data into a scratch buffer for every packet. Is the slowdown too much? It depends on your needs. If you're only inspecting occasional packets, or if you're deploying the LWF on a low-throughput NIC, it won't matter. If you're trying to get beyond 1Gbps, you shouldn't be memcpying so much data around.
Also note that if you ultimately want to modify the packet, you'll need to be wary of NdisGetDataBuffer. It can give you a copy of the data (stored in your local scratch buffer), so if you modify the payload, those changes won't actually stick to the packet.
What if you do need to scale to high throughputs, or modify the payload? Then you need to work out how to manipulate the MDL chain. That's a bit confusing at first, but spend a little time with the documentation and draw yourself some whiteboard diagrams.
I suggest first starting out by understanding an MDL. From networking's point of view, an MDL is just a fancy way of holding a { char * buffer, size_t length }, along with a link to the next MDL.
Next, consider the NB's DataOffset and DataLength. These conceptually move the buffer boundaries in from the beginning and the end of the buffer. They don't really care about MDL boundaries -- for example, you can reduce the length of the packet payload by decrementing DataLength, and if that means that one or more MDLs are no longer contributing any buffer space to the packet payload, it's no big deal, they're just ignored.
Finally, add on top CurrentMdl and CurrentMdlOffset. These are redundant with everything above, but they exist for (microbenchmark) performance. You aren't required to even think about them if you're reading the NB, but if you are editing the size of the NB, you do need to update them.

How to release memory allocated by a slice? [duplicate]

package main
import (
func main() {
storage := []string{}
for i := 0; i < 50000000; i++ {
storage = append(storage, "string string string string string string string string string string string string")
fmt.Println("done allocating, emptying")
storage = storage[:0]
storage = nil
for {
time.Sleep(1 * time.Second)
The code above will allocate about ~30mb of memory, and then won't release it. Why is that? How can I force go to release memory used by this slice? I sliced that slice and then nilled it.
The program I'm debugging is a simple HTTP input buffer: it appends all requests into large chunks, and sends these chunks over a channel to goroutine for processing. But problem is illustrated above - I can't get storage to release the memory and then eventually run out of memory.
Edit: as some people pointed out to similar question, no, it first doesn't work, second isn't what I'm asking for. The slice gets emptied, the memory does not.
There are several things going on here.
The first one which is needed to be absorbed is that Go is
a garbage-collected language; the actual algorithm of its GC
is mostly irrelevant but one aspect of it is crucial to understand:
it does not use reference counting, and hence there's no way to
somehow make the GC immediately reclaim the memory of any given
value whose storage is allocated on the heap.
To recap it in more simple words, it's futile to do
s := make([]string, 10*100*100)
s = nil
as the second statement will indeed remove the sole reference
to the slice's underlying memory but won't make the GC go
and "mark" that memory as available for reuse.
This means two things:
You should know how the GC works.
This explains how it works
since v1.5 and up until now (v1.10 these days).
You should structure those of your algorythms which are
memory-intensive in a way that reduces memory pressure.
The latter can be done in several ways:
Preallocate, when you have a sensible idea about how much to.
In your example, you start with a slice of length 0,
and then append to it a lot. Now, almost all library code which deals
with growing memory buffers—the Go runtime included—deals with these
allocations by 1) allocating twice the memory requested—hoping to
prevent several future allocations, and 2) copies the "old" contents
over, when it had to reallocate. This one is important: when reallocation
happens, it means there's two memory regions now: the old one and the new
If you can estimate that you may need to hold N elements on
average, preallocate for them using make([]T, 0, N)—
more info here
and here.
If you'll need to hold less than N elements, the tail of that buffer
will be unused, and if you'll need to hold more than N, you'll need
to reallocate, but on average, you won't need any reallocations.
Re-use your slice(s). Say, in your case, you could "reset" the slice
by reslicing it to the zero length and then use it again for the next
request. This is called "pooling", and in the case of mass-parallel access
to such a pool, you could use sync.Pool to hold your buffers.
Limit the load on your system to make the GC be able to cope with
the sustained load. A good overview of the two approaches to such
limiting is this.
In the program you wrote, it makes no sense to release memory because no part of code is requesting it any more.
To make a valid case, you have to request a new memory and release it inside the loop. Then you will observe that the memory consumption will stabilize at some point.

Multipart form uploads + memory leaks in golang?

The following server code:
package main
import (
func handler(w http.ResponseWriter, r *http.Request) {
file, _, err := r.FormFile("file")
if err != nil {
fmt.Fprintln(w, err)
defer file.Close()
func main() {
http.ListenAndServe(":8081", http.HandlerFunc(handler))
being run and then calling it with:
curl -i -F "file=#./large-file" --form hello=world http://localhost:8081/
Where the large-file is about 80MB seems to have some form of memory leak in Go 1.4.2 on darwin/amd64 and linux/amd64.
When I hook up pprof, I see that bytes.makeSlice uses 96MB of memory after calling the service a few times (eventually called by r.FormFile in my code above).
If I keep calling curl, the memory usage of the process grow slows over time, eventually seeming to stick around 300MB on my machine.
Thoughts? I assume this isn't expected/ I'm doing something wrong?
If the memory usage stagnates at a "maximum", I wouldn't really call that a memory leak. I would rather say the GC not being eager and being lazy. Or just don't want to physically free memory if it is frequently reallocated / needed. If it would be really a memory leak, used memory wouldn't stop at 300 MB.
r.FormFile("file") will result in a call to Request.ParseMultipartForm(), and 32 MB will be used as the value of maxMemory parameter (the value of defaultMaxMemory variable defined in request.go). Since you upload a larger file (80 MB), a buffer of size 32 MB at least will be created - eventually (this is implemented in multipart.Reader.ReadFrom()). Since bytes.Buffer is used to read the content, the reading process will start with a small or empty buffer, and reallocate whenever a bigger is needed.
The strategy of buffer reallocations and the buffer sizes are implementation dependent (and also depends on the size of the chunks being read/decoded from the request), but just to have a rough picture, imagine it like this: 0 bytes, 4 KB, 16 KB, 64 KB, 256 KB, 1 MB, 4 MB, 16 MB, 64 MB. Again, this is just theoretical, but illustrates that the sum can even grow beyond 100 MB just to read the first 32 MB of the file in memory at which point it will be decided that it will be moved/stored in file. See the implementation of multipart.Reader.ReadFrom() for details. This reasonably explains the 96 MB allocation.
Do this a couple of times, and without the GC releasing the allocated buffers immediately, you can easily end up with 300 MB. And if there is enough free memory, there is no pressure on the GC to hurry with releasing memory. The reason why you see it growing relatively big is because large buffers are used in the background. Would you do the same with uploading a 1MB file, you would probably not experience this.
If it is important to you, you can also call Request.ParseMultipartForm() manually with a smaller maxMemory value, e.g.
r.ParseMultipartForm(2 << 20) // 2 MB
file, _, err := r.FormFile("file")
// ... rest of your handler
Doing so much smaller (and fewer) buffers will be allocated in the background.

Go-lang parallel segment runs slower than series segment

I have built an epidemic mathematics model which is fairly computationally intense in Go. I'm trying now to build a set of systems to test my model, where I change an input and expect a different output. I built a version in series to slowly increase HIV prevalence and see effects on HIV deaths. It takes ~200 milliseconds to run.
for q = 0.0; q < 1000; q++ {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
Then I made a "parallel" version using channels, and it takes longer, ~400 milliseconds to run. These small changes are important as we will be running millions of runs with different inputs, so would like to make it as efficient as possible. Here is the parallel version:
ch := make(chan ChData)
var q float64
for q = 0.0; q < 1000; q++ {
go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
ch <- ChData{int(q), results.HivDeaths[20]}
}(q, inputs, ch)
for q = 0.0; q < 1000; q++ {
theResults := <-ch
Any thoughts are very much appreciated.
There's overhead to starting and communicating with background tasks. The time spent on your cost analyses probably dwarfs equals the cost of communication if the program was taking 200ms, but if coordination cost ever does kill your app, a common approach is to hand off largish chunks of work at a time--e.g., make each goroutine do analyses for a range of 10 q values instead of just one. (Edit: And as #Innominate says, making a "worker pool" of goroutines that process a queue of job objects is another common approach.)
Also, the code you pasted has a race condition. The contents of your Inputs struct don't get copied each time you spawn a goroutine, because you're passing your function a pointer. So goroutines running in parallel will read from and write to the same Inputs instance.
Simply making a brand new Inputs instance for each analysis, with its own arrays, etc. would avoid the race. If that ended up wasting tons of memory or causing lots of redundant copies, you could 1) recycle Inputs instances, 2) separate out read-only data that can safely be shared (maybe there's country data that's fixed, dunno), or 3) change some of the relatively big arrays to be local variables within costAnalysisHandler rather than stuff that needs to be passed around (maybe it could just take initial HIV prevalence and return HIV deaths at t=20, and everything else is local and on the stack).
This doesn't apply to Go today, but did when the question was originally posted: nothing is really running in parallel unless you call runtime.GOMAXPROCS() with your desired concurrency level, e.g., runtime.GOMAXPROCS(runtime.NumCPU()).
Finally, you should only worry about all of this if you're doing some larger analysis and actually have a performance problem; if .2 seconds of waiting is all that performance work can save you here, it's not worth it.
Parallelizing a computationally intensive set of calculations requires that the parallel computations can actually run in parallel on your machine. If they don't then the extra overhead of creating goroutines, channels and reading off the channel will make the program run slower.
I'm guessing that is the problem here.
Try setting the GOMAXPROCS environment variable to the number of CPU's you have before running your code. Or call runtime.GOMAXRPROCS(runtime.NumCPU()) before you start the parallell computations.
I see two issues related to parallel performance,
The first and more obvious one is that you must set GOMAXPROCS in order to get the Go runtime to use more than one cpu/core. Typically one would set it for the number of processors in the machine but the ideal setting can vary.
The second problem is a bit trickier, which is that your code doesn't appear to be parallelizing very well. Simply starting a thousand goroutines and assuming they'll work it out isn't going to give good results. You should probably be using some kind of worker pool, running a limited number of simultaneous computations(a good starting number would be to set it the same as GOMAXPROCS) rather than trying to do 1000 at once.
See: http://golang.org/doc/faq#Why_no_multi_CPU

Why is file I/O in large chunks SLOWER than in small chunks?

If you call ReadFile once with something like 32 MB as the size, it takes noticeably longer than if you read the equivalent number of bytes with a smaller chunk size, like 32 KB.
(No, my disk is not busy.)
Edit 1:
Forgot to mention -- I'm doing this with FILE_FLAG_NO_BUFFERING!
Edit 2:
I don't have access to my old machine anymore (PATA), but when I tested it there, it took around 2 times as long, sometimes more. On my new machine (SATA), I'm only getting a ~25% difference.
Here's a piece of code to test:
#include <memory.h>
#include <windows.h>
#include <tchar.h>
#include <stdio.h>
int main()
HANDLE hFile = CreateFile(_T("\\\\.\\C:"), GENERIC_READ,
const size_t chunkSize = 64 * 1024;
const size_t bufferSize = 32 * 1024 * 1024;
void *pBuffer = malloc(bufferSize);
DWORD start = GetTickCount();
ULONGLONG totalRead = 0;
OVERLAPPED overlapped = { 0 };
DWORD nr = 0;
ReadFile(hFile, pBuffer, bufferSize, &nr, &overlapped);
totalRead += nr;
_tprintf(_T("Large read: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
totalRead = 0;
start = GetTickCount();
overlapped.Offset = 0;
for (size_t j = 0; j < bufferSize / chunkSize; j++)
DWORD nr = 0;
ReadFile(hFile, pBuffer, chunkSize, &nr, &overlapped);
totalRead += nr;
overlapped.Offset += chunkSize;
_tprintf(_T("Small reads: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
__finally { CloseHandle(hFile); }
return 0;
Large read: 1076 for 67108864 bytes
Small reads: 842 for 67108864 bytes
Any ideas?
Your test is including the time it take to read in file metadata, specifically, the mapping of file data to disk. If you close the file handle and re-open it, you should get similar timings for each. I tested this locally to make sure.
The effect is probably more severe with heavy fragmentation, as you have to read in more file to disk mappings.
EDIT: To be clear, I ran this change locally, and saw nearly identical times with large and small reads. Reusing the same file handle, I saw similar timings from the original question.
This is not specific to windows. I did some tests a while back with the C++ iostream library and found there was an optimum buffer size for reads, above which performance degraded. Unfortunately, I no longer have the tests, and I can't remember what the size was :-). As to why, well there are a lot of issues, such as a large buffer possibly causing paging in other applications running at the same time (as the buffer can't be paged).
When you perform the 1024 * 32KB reads are you reading into the same memory block over and over, or are you allocating a total of 32MB to rad into as well and filling the entire 32MB?
If you're reading the smaller reads into the same 32K block of memory, then the time difference is probably simply that Windows doesn't have to scavenge up the additional memory.
Update based on the FILE_FLAG_NO_BUFFERING addition to the question:
I'm not 100% certain, but I believe that when FILE_FLAG_NO_BUFFERING is used, Windows will lock the buffer into physical memory so it can allow the device driver to deal with physical addresses (such as to DMA directly into the buffer). It could (I believe) do this by breaking up a large request into smaller requests, but I suspect that Microsoft might have the philosophy that "if you ask for FILE_FLAG_NO_BUFFERING then we assume you know what you're doing and we're not going to get in your way".
Of course locking 32MB all at once instead of 32KB at a time will require more resources. So this would be kind of like my initial guess, but at the physical memory level rather than the virtual memory level.
However, since I don't work for MS and don't have access to Windows source, I'm going by vague recollection from times when I worked closer with the Windows kernel and device driver model (so this is more or less speculation).
when you have done FILE_FLAG_NO_BUFFERING that means that the operating system will not buffer the I/O. So each time you call the read function it will make a system call which will fetch each time the data from the disk. Then to read one file with a fixed size if you use less buffer size then more system calls are needed so more user space to kernel space and for each time a disk I/O is initiated. Instead if you use larger block size then for the same file size to be read there would be less system calls required so the user to kernel space switches would be lesser, and the number of times the disk i/O initiated will also be lesser. This is why, generally larger block will require less time to read.
Try reading the file only 1 byte at a time without buffering, and try with 4096bytes block then and see the difference.
A possible explanation in my opinion would be command queueing with FILE_FLAG_NO_BUFFERING, since this does direct DMA reads at low level.
A single large request will of course still necessarily be broken into sub-requests, but those will likely be sent more or less one after another (because the driver needs to lock the pages and will in all likelihood be reluctant to lock several megabytes lest it hits the quota).
On the other hand, if you throw a dozen or two dozen requests at the driver, it will just forward them to the disk and the disk and take advantage of NCQ.
Well, that's what I'm thinking might be the reason anyway (this does not explain why the exact same phenomenon happens with buffered reads though, as in the Q that I linked to above).
What you are probably observing is that when using smaller blocks, the second block of data can be read while the first is being processed, then the third read while the second is being processed, etc. so that the speed limit is the slower of the physical read time or the processing time. If it takes the same amount of time to process one block as to read the next, the speed could be double what it would be if processing and reading were separate. When using larger blocks, the amount of data that is read while the first block is being processed will be limited to amount smaller than the block size. When the code is ready for the next block of data, part of it will have been read but some of it will not; it will thus be necessary for the code to wait while the remainder of the data is fetched.
