Mathlink memory usage accumulation

Mathlink memory usage accumulation - wolfram-mathematica

I use MathLink to send and receive independent mma expressions from a C++ application as strings.
std::string expression[N];
// ...
for(int i = 0; i < N; ++i) {
MLPutFunction(l, "EnterTextPacket", 1);
MLPutString(l, expression[i].c_str());
MLEndPacket(l);
// Check Packet ...
const char* result;
MLGetString(l, &result);
// process result ...
MLDisownString(l, result);
}
I would expect that MLDisownString frees the used memory except that it doesn't.
Any ideas?

Ok. Posting this as an answer, because I believe the odds you are using version 5 or below are pretty low:
`As of Version 6.0, MLDisownString() has been superseded by MLReleaseString()`
Check it here

First of all, I should point out such parameter as $HistoryLength. Setting it to zero often allows to reduce memory requirements considerably:
$HistoryLength = 0
At the same time, it is known problem with the MathKernel process that it accumulates system memory in long computations and does not release it.
The only way to ultimately solve the problem it to restart the kernel when it takes too much memory or when the amount of available free physical memory becomes too small. This task can be automatized.
If you have not tried Mathematica 8 yet, it may be worth a try, since, according to Oliver Ruebenkoenig:
For version 8 the memory allocator has
been rewritten and improved.
(What a small sentence for such a huge
endeavor and such a fine execution)
But I have not tried the version 8 yet and cannot say anything on it.

Related

Is there a way to "unfetch" a cache line?

Let's say I'm looping through 10 different 4kb arrays of ints, incrementing them:
int* buffers[10] = ...; // 10 4kb buffers, not next to each other
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 512; j++) {
buffers[i][j]++;
}
}
The compiler/CPU are pretty cool, and can do some cache prefetching for the inner loop. That's awesome. But...
...I've just eaten up to 40kb of cache, and kicked out data which the rest of my program enjoyed having in the cache.
It would be cool if I could hint to the compiler or CPU that "I'm not touching this memory again in the foreseeable future, so you can reuse these cache lines":
int* buffers[10] = ...;
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 512; j++) {
buffers[i][j]++;
}
// Unfetch entire 4kb buffer
cpu_cache_unfetch(buffers[i], 4096);
}
cpu_cache_unfetch would conceptually "doom" any cache lines in that range, throwing them away first.
In the end, this will mean that my little snippet of code uses 4kb of cache, instead of 40kb. It would reuse the 4kb of cache 10 times. The rest of the program would appreciate that very much.
Would this even make sense? If so, is there a way to do this?
Also appreciated: let me know all the ways I've shown myself to fundamentally misunderstand caching! =D

I only know the answer for x86. This is definitely architecture-specific; different ISAs have different cache-control features.
On x86, yes, clflush / clflushopt, but they only evict one single cache line per execution. (They force write-back + eviction, like you'd need for memory-mapped non-volatile storage). My understanding is that clflushopt is not usually worth it for a case like this, vs. just allowing cache pollution to happen.
In theory there are possible speedups from using NT prefetch for read-only, but that's brittle (tuning the software-prefetch depends on the HW, and getting it wrong can hurt a lot). Doing a regular store would probably undo the effects of an NT prefetch and leave the line in the most-recently-used position in L1, L2, and L3.
One possibly-crazy approach would be NT stores. Load a whole cache-line of data (four 16-byte vectors = 64 bytes), then store the updated values with movntdq.
NT means "non-temporal"; for use when data will not be referenced again in the near future (even by another core). What is the meaning of "non temporal" memory accesses in x86 has some pretty generic answers, but may help.
According to Intel's manual, NT stores evict the destination cache line if it was previously cached (What happens with a non-temporal store if the data is already in cache?), so it would work for your use-case. But the compiler would have to be sure to reach a 64-byte alignment boundary in the inner loop so it can read one or two whole cache lines, instead of reading 32 bytes of one and 32 bytes of another, and evicting it with an NT store before reading the last 32 bytes of a line. (Pointer math is easy in asm, though; Compilers do know how to go scalar until an alignment boundary.)
The normal use-case for NT stores is for write-only destination buffers to avoid the MESI RFO overhead, but this use-case is at least possibly a win.
See discussion in comments chat: this might perform significantly worse. Definitely benchmark both ways before doing this, preferably on a variety of hardware including multi-socket systems.
It's also almost definitely worse if the array was hot in cache to start with. I was assuming that this was the only thing to touch it, rather than the last in a chain of modifications.

Does mmap allocate heap memory contiguously?

Provided that:
The size I request is a multiple of the page size
The start address I request is the size + start address of the last allocation
If I always follow these rules when using mmap to allocate memory on the heap, will the addresses returned be contiguous? Or could there be gaps between them?

You can get the behavior you want with the MAP_FIXED flag. Unfortunately for your goal, it's not universally supported, so you'd want to check the return value to ensure that it gave you the allocation you requested. For good portability, you'd need a backup plan for when the call returns 0.

Quick Answer: Not necessarily. There's a good chance it will "almost always work" in both limited an extensive testing on a variety of machines, but its definitely not good practice. The MAP_FIXED flag is supported on most flavors of Linux but it is also buggy in my experience. Avoid.
Better in your case is to simply allocate everything you need at once, and then assign pointers manually to each sub-section of the mapping:
int LengthOf_FirstThing = 0x18000;
int LengthOf_SecondThing = 0x10100;
int LengthOf_ThirdThing = 0x20000;
int _pagesize = getpagesize();
int _pagemask = _pagesize - 1;
size_t sizeOfEverything = LengthOf_FirstThing + LengthOf_SecondThing + LengthOf_ThirdThing;
sizeOfEverything = (sizeOfEverything + _pagemask) & ~(_pagemask);
int8_t* result = (int8_t*)mmap(nullptr, sizeOfEverything, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
int8_t* myFirstThing = result;
int8_t* mySecondThing = myFirstThing + LengthOf_FirstThing;
int8_t* myThirdThing = mySecondThing + LengthOf_SecondThing;
An advantage of this approach also being that each of things you're mapping don't have to be strictly aligned to the page size. And most importantly, it assures fully contigious memory.
Longer answer:
Implementations of mmap() can freely disregard the 'hint' address entirely and so you should never expect the address to be honored. This may be more common than expected, because some implementations may not actually support pagesize granularity for new mmap()'s. They may limit valid starting maps to 16k or 64k boundaries to help reduce the overhead needed to manage very large virtual address spaces. Such an implementation would always disregard an mmap() hint that isn't aligned to such boundary.
Additionally, mmap() does not allocate memory from the heap at all. The heap is an area of memory created/reserved by the C runtime libraries (glibc on *nix) when a process is created. malloc() and new/delete are typically the only functions that pull from the heap, along with any libraries that may use malloc/new internally. The heap itself is typically created and managed by calls to mmap() internally.

I think this is not specified but a so called "implementation detail". I.e. you should not rely on one behaviour or the other, but assume that the pointer is opaque and not be concerned with its exact value.
(That said, there can be a place and time for hacks. In that case you need to find out exactly how your OS behaves.)

Why is this simple OpenCL kernel running so slowly?

I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:
__kernel void copy(
const __global char* pSrc,
__global __write_only char* pDst,
int length)
{
const int tid = get_global_id(0);
if(tid < length) {
pDst[tid] = pSrc[tid];
}
}
I've created the buffers in the following way:
char* out = new char[2048*2048];
cl::Buffer(
context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
length,
out);
Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(1),
NULL,
&event);
event.wait();
On average, the time is around 75 milliseconds, as calculated by:
cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";
I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?
EDIT:
At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!

I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.
Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.
cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(myOptimalGroupSize),
NULL,
&event);
event.wait();
You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.

CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).
In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization.
The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block.
The GPU approach is orthogonal: each thread participates in the copy of the same block.
This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).
To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.

I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.

Why is file I/O in large chunks SLOWER than in small chunks?

If you call ReadFile once with something like 32 MB as the size, it takes noticeably longer than if you read the equivalent number of bytes with a smaller chunk size, like 32 KB.
Why?
(No, my disk is not busy.)
Edit 1:
Forgot to mention -- I'm doing this with FILE_FLAG_NO_BUFFERING!
Edit 2:
Weird...
I don't have access to my old machine anymore (PATA), but when I tested it there, it took around 2 times as long, sometimes more. On my new machine (SATA), I'm only getting a ~25% difference.
Here's a piece of code to test:
#include <memory.h>
#include <windows.h>
#include <tchar.h>
#include <stdio.h>
int main()
{
HANDLE hFile = CreateFile(_T("\\\\.\\C:"), GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
OPEN_EXISTING, FILE_FLAG_NO_BUFFERING /*(redundant)*/, NULL);
__try
{
const size_t chunkSize = 64 * 1024;
const size_t bufferSize = 32 * 1024 * 1024;
void *pBuffer = malloc(bufferSize);
DWORD start = GetTickCount();
ULONGLONG totalRead = 0;
OVERLAPPED overlapped = { 0 };
DWORD nr = 0;
ReadFile(hFile, pBuffer, bufferSize, &nr, &overlapped);
totalRead += nr;
_tprintf(_T("Large read: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
totalRead = 0;
start = GetTickCount();
overlapped.Offset = 0;
for (size_t j = 0; j < bufferSize / chunkSize; j++)
{
DWORD nr = 0;
ReadFile(hFile, pBuffer, chunkSize, &nr, &overlapped);
totalRead += nr;
overlapped.Offset += chunkSize;
}
_tprintf(_T("Small reads: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
fflush(stdout);
}
__finally { CloseHandle(hFile); }
return 0;
}
Result:
Large read: 1076 for 67108864 bytes
Small reads: 842 for 67108864 bytes
Any ideas?

Your test is including the time it take to read in file metadata, specifically, the mapping of file data to disk. If you close the file handle and re-open it, you should get similar timings for each. I tested this locally to make sure.
The effect is probably more severe with heavy fragmentation, as you have to read in more file to disk mappings.
EDIT: To be clear, I ran this change locally, and saw nearly identical times with large and small reads. Reusing the same file handle, I saw similar timings from the original question.

This is not specific to windows. I did some tests a while back with the C++ iostream library and found there was an optimum buffer size for reads, above which performance degraded. Unfortunately, I no longer have the tests, and I can't remember what the size was :-). As to why, well there are a lot of issues, such as a large buffer possibly causing paging in other applications running at the same time (as the buffer can't be paged).

When you perform the 1024 * 32KB reads are you reading into the same memory block over and over, or are you allocating a total of 32MB to rad into as well and filling the entire 32MB?
If you're reading the smaller reads into the same 32K block of memory, then the time difference is probably simply that Windows doesn't have to scavenge up the additional memory.
Update based on the FILE_FLAG_NO_BUFFERING addition to the question:
I'm not 100% certain, but I believe that when FILE_FLAG_NO_BUFFERING is used, Windows will lock the buffer into physical memory so it can allow the device driver to deal with physical addresses (such as to DMA directly into the buffer). It could (I believe) do this by breaking up a large request into smaller requests, but I suspect that Microsoft might have the philosophy that "if you ask for FILE_FLAG_NO_BUFFERING then we assume you know what you're doing and we're not going to get in your way".
Of course locking 32MB all at once instead of 32KB at a time will require more resources. So this would be kind of like my initial guess, but at the physical memory level rather than the virtual memory level.
However, since I don't work for MS and don't have access to Windows source, I'm going by vague recollection from times when I worked closer with the Windows kernel and device driver model (so this is more or less speculation).

when you have done FILE_FLAG_NO_BUFFERING that means that the operating system will not buffer the I/O. So each time you call the read function it will make a system call which will fetch each time the data from the disk. Then to read one file with a fixed size if you use less buffer size then more system calls are needed so more user space to kernel space and for each time a disk I/O is initiated. Instead if you use larger block size then for the same file size to be read there would be less system calls required so the user to kernel space switches would be lesser, and the number of times the disk i/O initiated will also be lesser. This is why, generally larger block will require less time to read.
Try reading the file only 1 byte at a time without buffering, and try with 4096bytes block then and see the difference.

A possible explanation in my opinion would be command queueing with FILE_FLAG_NO_BUFFERING, since this does direct DMA reads at low level.
A single large request will of course still necessarily be broken into sub-requests, but those will likely be sent more or less one after another (because the driver needs to lock the pages and will in all likelihood be reluctant to lock several megabytes lest it hits the quota).
On the other hand, if you throw a dozen or two dozen requests at the driver, it will just forward them to the disk and the disk and take advantage of NCQ.
Well, that's what I'm thinking might be the reason anyway (this does not explain why the exact same phenomenon happens with buffered reads though, as in the Q that I linked to above).

What you are probably observing is that when using smaller blocks, the second block of data can be read while the first is being processed, then the third read while the second is being processed, etc. so that the speed limit is the slower of the physical read time or the processing time. If it takes the same amount of time to process one block as to read the next, the speed could be double what it would be if processing and reading were separate. When using larger blocks, the amount of data that is read while the first block is being processed will be limited to amount smaller than the block size. When the code is ready for the next block of data, part of it will have been read but some of it will not; it will thus be necessary for the code to wait while the remainder of the data is fetched.

Which (OS X) dtrace probe fires when a page is faulted in from disk?

I'm writing up a document about page faulting and am trying to get some concrete numbers to work with, so I wrote up a simple program that reads 12*1024*1024 bytes of data. Easy:
int main()
{
FILE*in = fopen("data.bin", "rb");
int i;
int total=0;
for(i=0; i<1024*1024*12; i++)
total += fgetc(in);
printf("%d\n", total);
}
So yes, it goes through and reads the entire file. The issue is that I need the dtrace probe that is going to fire 1536 times during this process (12M/8k). Even if I count all of the fbt:mach_kernel:vm_fault*: probes and all of the vminfo::: probes, I don't hit 500, so I know I'm not finding the right probes.
Anyone know where I can find the dtrace probes that fire when a page is faulted in from disk?
UPDATE:
On the off chance that the issue was that there was some intelligent pre-fetching going on in the stdio functions, I tried the following:
int main()
{
int in = open("data.bin", O_RDONLY | O_NONBLOCK);
int i;
int total=0;
char buf[128];
for(i=0; i<1024*1024*12; i++)
{
read(in, buf, 1);
total += buf[0];
}
printf("%d\n", total);
}
This version takes MUCH longer to run (42s real time, 10s of which was user and the rest was system time - page faults, I'm guessing) but still generates one fifth as many faults as I would expect.
For the curious, the time increase is not due to loop overhead and casting (char to int.) The code version that does just these actions takes .07 seconds.

Not a direct answer, but it seems you are equating disk reads and page faults. They are not necessarily the same. In your code you are reading data from a file into a small user memory chunk, so the I/O system can read the file into the buffer/VM cache in any way and size it sees fit. I might be wrong here, I don't know how Darwin does this.
I think the more reliable test would be to mmap(2) the whole file into process memory and then go touch each page is that space.

I was down the same rathole recently. I don't have my DTrace scripts or test programs available just now, but I will give you the following advice:
1.) Get your hands on OS X Internals by Amit Singh and read section 8.3 on virtual memory (this will get you in the right frame of reference for selecting DTrace probes).
2.) Get your hands on Solaris Performance and Tools by Brendan Gregg / Jim Mauro. Read the section on virtual memory and pay close attention to the example DTrace scripts that make use of the vminfo provider.
3.) OS X is definitely prefetching large chunks of pages from the filesystem, and your test program is playing right into this optimization (since you're reading sequentially). Interestingly, this is not the case for Solaris. Try randomly accessing the big array to defeat the prefetch.

The assumption that the operating system will fault in each and every page that's being touched as a separate operation (and that therefore, if you touch N pages, you'll see the DTrace probe fire N times) is flawed; most UN*Xes will perform some sort of readahead or pre-faulting and you're very unlikely to get exactly the same number of calls to as you have pages. This is so even if you use mmap() directly.
The exact ratio may also depend on the filesystem, as readahead and page clustering implementations and thresholds are unlikely to be the same for all of them.
You probably can force a per-page fault policy if you use mmap directly and then apply madvise(MADV_DONTNEED) or similar and/or purge the entire range with msync(MS_INVALIDATE).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio