Why do we need external sort? - algorithm

The main reason for external sort is that the data may be larger than the main memory we have.However,we are using virtual memory now, and the virtual memory will take care of swapping between main memory and disk.Why do we need to have external sort then?

An external sort algorithm makes sorting large amounts of data efficient (even when the data does not fit into physical RAM).
While using an in-memory sorting algorithm and virtual memory satisfies the functional requirements for an external sort (that is, it will sort the data), it fails to achieve the non-functional requirement of being efficient. A good external sort minimises the amount of data read and written to external storage (and historically also seek times), and a general-purpose virtual memory implementation on top of a sort algorithm not designed for this will not be competitive with an algorithm designed to minimise IO.

In addition to #Anonymous's answer that external sort is better optimized for less disk IO, sometimes using in-memory sort and using the virtual memory is infeasible, since the virtual memory space is smaller than the file's size.
For example, if you have a 32 bits system (there are still a lot of these), and you want to sort a 20 GB file, 32bits system allow you to have 2^32 ~= 4GB virtual addresses, but the file you are trying to sort cannot fit in.
This used to be a real issue when 64 bits systems were still not very common, and is still an issue today for old 32 bits systems and some embadded devices.
However, even for 64 bits system, as expained in previous answers, the external sort algorithm is more optimized for the nature of sorting, and will require significantly less disk IO than letting the OS "take care of things".

I'm using Windows, in common line shell, you could run "systeminfo", it gives me my laptop's memory usage information.
Total Physical Memory: 8,082 MB
Available Physical Memory: 2,536 MB
Virtual Memory: Max Size: 11,410 MB
Virtual Memory: Available: 2,686 MB
Virtual Memory: In Use: 8,724 MB
I just write a app to test max size of array I could initialize from my laptop.
public static void BurnMemory()
{
for(var i = 1; i <= 1024; i++)
{
long size = 1 << i;
long t = 4 * size / (1 << 30);
try
{
// 1 int32 takes 32 bit(4 byte) memmory,
var arr = new int[size];
Console.WriteLine("Test pass initialize a array with size = 2^" + i.ToString());
}
catch(OutOfMemoryException err)
{
Console.WriteLine("Reach memory limitation when initialize a array with size = 2^{0} int32 = 4 x {1}B= {2}TB",i, size, t );
break;
}
}
}
It seems it terminate when it is trying to initialize array with size of 2^29.
Reach memory limitation when initialize a array with size = 2^29 int32 = 4 x 536870912B= 2TB
What I get from the test:
It is not hard to reach the memory limitation.
We need to understand our server's capability, then decide whether use in-memory sort or external sort.

Related

How do I correctly adjust the Argon2 parameters in Go to consume less memory?

Argon2 by design is memory hungry. In the semi-official Go implementation, the following parameters are recommended when using IDKey:
key := argon2.IDKey([]byte("some password"), salt, 1, 64*1024, 4, 32)
where 1 is the time parameter and 64*1024 is the memory parameter. This means the library will create a 64MB buffer when hashing a value. In scenarios where many hashing procedures might run at the same time this creates high pressure on the host memory.
In cases where this is too much memory consumption it is advised to decrease the memory parameter and increase the time factor:
The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number. If using that amount of memory (64 MB) is not possible in some contexts then the time parameter can be increased to compensate.
So, assuming I would like to limit memory consumption to 16MB (1/4 of the recommended 64MB), it is still unclear to me how I should be adjusting the time parameter: is this supposed to be times 4 so that the product of memory and time stays the same? Or is there some other logic behind the correlation of time and memory at play?
The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number. If using that amount of memory (64 MB) is not possible in some contexts then the time parameter can be increased to compensate.
I think the key here is the word "to compensate", so in this context it is trying to say: to achieve similar hashing complexity as IDKey([]byte("some password"), salt, 1, 64*1024, 4, 32), you can try IDKey([]byte("some password"), salt, 4, 16*1024, 4, 32).
But if you want to decrease hashing result complexity (and decreasing performance overhead), you can decrease the size of memory uint32 disregarding the time parameter.
is this supposed to be times 4 so that the product of memory and time stays the same?
I dont think so, i believe the memory here means the length of result hash, but time parameter could mean "how many times the hashing result needs to be re-hash-ed until i get the end result".
So these 2 parameters are independent of each other. These are just controlling how much "brute- force cost savings due to time-memory tradeoffs" you want to achieve
Difficulty is roughly equal to time_cost * memory_cost (and possibly / parallelism). So if you 0.25x the memory cost, you should 4x the time cost. See also this answer.
// The time parameter specifies the number of passes over the memory and the
// memory parameter specifies the size of the memory in KiB.
Check out the Argon2 API itself. I'm going to cross-reference a little bit and use the argon2-cffi documentation. It looks like the go interface uses the C-FFI (foreign function interface) under the hood, so the protoype should be the same.
Parameters
time_cost (int) – Defines the amount of computation realized and therefore the execution time, given in number of iterations.
memory_cost (int) – Defines the memory usage, given in kibibytes.
parallelism (int) – Defines the number of parallel threads (changes the resulting hash value).
hash_len (int) – Length of the hash in bytes.
salt_len (int) – Length of random salt to be generated for each password.
encoding (str) – The Argon2 C library expects bytes. So if hash() or verify() are passed an unicode string, it will be encoded using this encoding.
type (Type) – Argon2 type to use. Only change for interoperability with legacy systems.
Indeed, if we look at the Go docs:
// The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number.
// If using that amount of memory (64 MB) is not possible in some contexts then
// the time parameter can be increased to compensate.
//
// The time parameter specifies the number of passes over the memory and the
// memory parameter specifies the size of the memory in KiB. For example
// memory=64*1024 sets the memory cost to ~64 MB. The number of threads can be
// adjusted to the numbers of available CPUs. The cost parameters should be
// increased as memory latency and CPU parallelism increases. Remember to get a
// good random salt.
I'm not 100% clear on the impact of thread count, but I believe it does parallelize the hashing, and like any multithreaded job, this reduces the total amount of time taken by approximately 1/N, for N cores. Apparently, you should essentially set the parallelism to cpu count.

A heap manager for C/Pascal that automatically fills freed memory with zero bytes

What do you think about an option to fill freed (not actually used) pages with zero bytes? This may improve performance under Windows, and also under VMWare and other virtual machine environments? For example, VMWare and HyperV calculate hash of memory pages, and, if the contents is the same, mark this page as "shared" inside a virtual machine and between virtual machines on the same host, until the page is modified. It effectively decreases memory consumption. Windows does the same - it handles zero pages differently, treating them as free.
We could have the heap manager that would automatically fill memory with zeros when we call FreeMem/ReallocMem. As an alternative option, we could have a function that zeroizes empty memory by demand, i.e. only when this function is explicitly called. Of course, this function has to be thread-safe.
The drawback of filling memory with zeros is touching the memory, which might have already been turned into virtual, thus issuing page faults. Besides that, any memory store operations are slow, so our program will be slower, albeit to an unknown extent (maybe negligible).
If we manage to fill 4-K pages completely with zeros, the hypervisor or Windows will explicitly mark it as a zero page. But even partial zeroizing may be beneficial, since the hypervisor may compress pages using LZ or similar algorithms to save physical memory.
I just want to know your opinion whether the benefits of filling emptied heap memory with zero bytes by the heap manager itself will outweigh the disadvantages of such a technique.
Is zeroizing worth its price when we buy reduced physical memory consumption?
When you have a page whose contents you no longer care about but you still want to keep it allocated, you can call VirtualAlloc (and variants) and pass the MEM_RESET flag.
From VirtualAlloc on MSDN:
MEM_RESET
Indicates that data in the memory range specified by lpAddress and
dwSize is no longer of interest. The pages should not be read from or
written to the paging file. However, the memory block will be used
again later, so it should not be decommitted. This value cannot be
used with any other value.
Using this value does not guarantee that
the range operated on with MEM_RESET will contain zeros. If you want
the range to contain zeros, decommit the memory and then recommit it.
This gives the best of both worlds - you don't have the cost of zeroing the memory, and the system does not have the cost of paging it back in. You get to take advantage of the well-tuned memory manager which already has a zero-pool.
Similar functionality also exists on Linux under the MADV_FREE (or MADV_DONTNEED for Posix) flag to madvise. Glibc uses this function in the implementation of its heap.:
/*
* Stack:
* int shrink_heap (heap_info *h, long diff)
* int heap_trim (heap_info *heap, size_t pad) at arena.c:660
* void _int_free (mstate av, mchunkptr p, int have_lock) at malloc.c:4097
* void __libc_free (void *mem) at malloc.c:2948
* void free(void *mem)
*/
static int
shrink_heap (heap_info *h, long diff)
{
long new_size;
new_size = (long) h->size - diff;
/* ... snip ... */
__madvise ((char *) h + new_size, diff, MADV_DONTNEED);
/* ... snip ... */
h->size = new_size;
return 0;
}
If your heap is in user space this will never work. The kernel can only trust itself, not user space. If the kernel zeros a page, it can treat it as zero. If user space says it zeroed a page, the kernel would still have to check that. It might just as well zero it. One thing user space can do is to discard pages. Which marks them as "don't care". Then a kernel can treat them as zero. But manually zeroing pages in user space is futile.

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

Does mmap allocate heap memory contiguously?

Provided that:
The size I request is a multiple of the page size
The start address I request is the size + start address of the last allocation
If I always follow these rules when using mmap to allocate memory on the heap, will the addresses returned be contiguous? Or could there be gaps between them?
You can get the behavior you want with the MAP_FIXED flag. Unfortunately for your goal, it's not universally supported, so you'd want to check the return value to ensure that it gave you the allocation you requested. For good portability, you'd need a backup plan for when the call returns 0.
Quick Answer: Not necessarily. There's a good chance it will "almost always work" in both limited an extensive testing on a variety of machines, but its definitely not good practice. The MAP_FIXED flag is supported on most flavors of Linux but it is also buggy in my experience. Avoid.
Better in your case is to simply allocate everything you need at once, and then assign pointers manually to each sub-section of the mapping:
int LengthOf_FirstThing = 0x18000;
int LengthOf_SecondThing = 0x10100;
int LengthOf_ThirdThing = 0x20000;
int _pagesize = getpagesize();
int _pagemask = _pagesize - 1;
size_t sizeOfEverything = LengthOf_FirstThing + LengthOf_SecondThing + LengthOf_ThirdThing;
sizeOfEverything = (sizeOfEverything + _pagemask) & ~(_pagemask);
int8_t* result = (int8_t*)mmap(nullptr, sizeOfEverything, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
int8_t* myFirstThing = result;
int8_t* mySecondThing = myFirstThing + LengthOf_FirstThing;
int8_t* myThirdThing = mySecondThing + LengthOf_SecondThing;
An advantage of this approach also being that each of things you're mapping don't have to be strictly aligned to the page size. And most importantly, it assures fully contigious memory.
Longer answer:
Implementations of mmap() can freely disregard the 'hint' address entirely and so you should never expect the address to be honored. This may be more common than expected, because some implementations may not actually support pagesize granularity for new mmap()'s. They may limit valid starting maps to 16k or 64k boundaries to help reduce the overhead needed to manage very large virtual address spaces. Such an implementation would always disregard an mmap() hint that isn't aligned to such boundary.
Additionally, mmap() does not allocate memory from the heap at all. The heap is an area of memory created/reserved by the C runtime libraries (glibc on *nix) when a process is created. malloc() and new/delete are typically the only functions that pull from the heap, along with any libraries that may use malloc/new internally. The heap itself is typically created and managed by calls to mmap() internally.
I think this is not specified but a so called "implementation detail". I.e. you should not rely on one behaviour or the other, but assume that the pointer is opaque and not be concerned with its exact value.
(That said, there can be a place and time for hacks. In that case you need to find out exactly how your OS behaves.)

Allocating more memory to an existing Global memory array

is it possible to add memory to a previously allocated array in global memory?
what i need to do is this:
//cudamalloc memory for d_A
int n=0;int N=100;
do
{
Kernel<<< , >>> (d_A,n++);
//add N memory to d_A
while(n!=5)}
does doing another cudamalloc removes the values of the previously allocated array? in my case the values of the previous allocated array should be kept...
First, cudaMalloc behaves like malloc, not realloc. This means that cudaMalloc will allocate totally new device memory at a new location. There is no realloc function in the cuda API.
Secondly, as a workaround, you can just use cudaMalloc again to allocate more more memory. Remember to free the device pointer with cudaFree before you assign a new address to d_a. The following code is functionally what you want.
int n=0;int N=100;
//set the initial memory size
size = <something>;
do
{
//allocate just enough memory
cudaMalloc((void**) &d_A, size);
Kernel<<< ... >>> (d_A,n++);
//free memory allocated for d_A
cudaFree(d_A);
//increase the memory size
size+=N;
while(n!=5)}
Thirdly, cudaMalloc can be an expensive operation, and I expect the above code will be rather slow. I think you should consider why you want to grow the array. Can you allocate memory for d_A one time with enough memory for the largest use case? There is likely no reason to allocate only 100 bytes if you know you need 1,000 bytes later on!
//calculate the max memory requirement
MAX_SIZE = <something>;
//allocate only once
cudaMalloc((void**) &d_A, MAX_SIZE);
//use for loops when they are appropriate
for(n=0; n<5; n++)
{
Kernel<<< ... >>> (d_A,n);
}
Your psuedocode does not "add memory to a previously allocated array" at all. The standard C way of increasing the size of an existing allocation is via the realloc() function, and there is no CUDA equivalent of realloc() at the time of writing.
When you do
cudaMalloc(d_A....)
// something
cudaMalloc(d_A....)
all you are doing is creating a new memory allocation and assigning it to d_A. The previous memory allocation still exists, but now you have lost the pointer value of the previous memory and have no way of accessing it. Based on this and your previous question on almost the same subject, might I suggest you spend a bit of time revising memory and pointer concepts in C before you try CUDA, because unless you have a very clear understanding of these fundamentals, you will find the distributed memory nature of CUDA to be very confusing,
I'm not sure what complications cuda adds to the mix(?) but in c you can't add memory to an already allocated array.
If you want to grow a malloc'd array, you need to malloc a new array of the size you need and copy the contents from the existing array.
If you're doing this often then it's probably worth mallocing more than you need each time to avoid costly (in terms of processing time) re-allocation operations.

Resources