Why Garbage Collect in web apps? - memory-management

Consider building a web app on a platform where every request is handled by a User Level Thread(ULT) (green thread/erlang process/goroutine/... any light weight thread). Assuming every request is stateless and resources like DB connection are obtained at startup of the app and shared between these threads. What is the need for garbage collection in these threads?
Generally such a thread is short running(a few milliseconds) and if well designed doesn't use more than a few (KB or MB) of memory. If garbage collection of the resources allocated in the thread is done at the exit of the thread and independent of the other threads, then there would be no GC pauses for even the 98th or 99th percentile of requests. All requests would be answered in predictable time.
What is the problem with such a model and why is it not being widely used?

You assumption might not be true.
if well designed doesn't use more than a few (KB or MB) of memory
Imagine a function for counting words in a text file which is used in a web app. Some naive implementation could be,
def count_words(text):
words = text.split()
count = {}
for w in words:
if w in count:
count[w] += 1
else:
count[w] = 1
return count
It allocates larger memory than text.

Related

Does PyTorch allocate GPU memory eagerly?

Consider the following script:
import torch
def unnecessary_compute():
x = torch.randn(1000,1000, device='cuda')
l = []
for i in range(5):
print(i,torch.cuda.memory_allocated())
l.append(x**i)
unnecessary_compute()
Running this script with PyTorch (1.11) generates the following output:
0 4000256
1 8000512
2 12000768
3 16001024
4 20971520
Given that PyTorch uses asynchronous computation and we never evaluated the contents of l or of a tensor that depends on l, why did PyTorch eagerly allocate GPU memory to the new tensors? Is there a way of invoking these tensors in an utterly lazy way (i.e., without triggering GPU memory allocation before it is required)?
torch.cuda.memory_allocated() returns the memory that has been allocated, not the memory that has been "used".
In a typical GPU compute pipeline, you would record operations in a queue along with whatever synchronization primitives your API offers. The GPU will then dequeue and execute those operations, respecting the enqueued synchronization primitives. However, GPU memory allocation is not usually an operation which even goes on the queue. Rather, there's usually some sort of fundamental instruction that the CPU can issue to the GPU in order to allocate memory, just as recording operations is another fundamental instruction. This means that the memory necessary for a GPU operation has to be allocated before the operation has even been enqueued; there is no "allocate memory" operation in the queue to synchronize with.
Consider Vulkan as a simple example. Rendering operations are enqueued on a graphics queue. However, memory is typically allocated via calls to vkAllocateMemory(), which does not accept any sort of queue at all; it only accepts the device handle and information about the allocation (size, memory type, etc). From my understanding, the allocation is done "immediately" / synchronously (the memory is safe to use by the time the function call returns on the CPU).
I don't know enough about GPUs to explain why this is the case, but I'm sure there's a good reason. And perhaps the limitations vary from device to device. But if I were to guess, memory allocation probably has to be a fairly centralized operation; it can't be done by just any core executing recorded operations on a queue. This would make sense, at least; the space of GPU memory is usually shared across cores.
Let's apply this knowledge to answer your question: When you call l.append(x**i), you're trying to record a compute operation. That operation will require memory to store the result, and so PyTorch is likely allocating the memory prior to enqueuing the operation. This explains the behavior you're seeing.
However, this doesn't invalidate PyTorch's claims about asynchronous compute. The memory might be allocated synchronously, but it won't be populated with the result of the operation until the operation has been dequeued and completed by the GPU, which indeed happens asynchronously.
I was able to reproduce your problem. I cannot really tell you why it behaves like that. I just think the (randomly) initialized tensor needs a certain amount of memory. For instance if you call x = torch.randn(0,0, device='cuda') the tensor does not allocate any GPU memory and x = torch.zeros(1000,10000, device='cuda') allocates 4000256 as in your example.
To load the tensors lazy, I suggest you create them on CPU and send them on the GPU briefly before using them. Kind of a speeed/memory tradeoff. I changed your code accordingly:
import torch
def unnecessary_compute():
x = torch.randn(1000,1000, device='cpu')
l = []
for i in range(5):
print(i,torch.cuda.memory_allocated())
l.append(x**i)
print("Move to cuda")
for i, tensor_x in enumerate(l):
l[i]=tensor_x.to('cuda')
print(i,torch.cuda.memory_allocated())
unnecessary_compute()
that produced the following output:
0 0
1 0
2 0
3 0
4 0
Move to cuda
0 4000256
1 8000512
2 12000768
3 16001024
4 20971520

Faster memory allocation and freeing algorithm than multiple Free List method

We allocate and free many memory blocks. We use Memory Heap. However, heap access is costly.
For faster memory access allocation and freeing, we adopt a global Free List. As we make a multithreaded program, the Free List is protected by a Critical Section. However, Critical Section causes a bottleneck in parallelism.
For removing the Critical Section, we assign a Free List for each thread, i.e. Thread Local Storage. However, thread T1 always memory blocks and thread T2 always frees them, so Free List in thread T2 is always increasing, meanwhile there is no benefit of Free List.
Despite of the bottleneck of Critical Section, we adopt the Critical Section again, with some different method. We prepare several Free Lists as well as Critical Sections which is assigned to each Free List, thus 0~N-1 Free Lists and 0~N-1 Critical Sections. We prepare an atomic-operated integer value which mutates to 0, 1, 2, ... N-1 then 0, 1, 2, ... again. For each allocation and freeing, we get the integer value X, then mutate it, access X-th Critical Section, then access X-th Free List. However, this is quite slower than the previous method (using Thread Local Storage). Atomic operation is quite slow as there are more threads.
As mutating the integer value non-atomically cause no corruption, we did the mutation in non-atomic way. However, as the integer value is sometimes stale, there is many chance of accessing the same Critical Section and Free List by different threads. This causes the bottleneck again, though it is quite few than the previous method.
Instead of the integer value, we used thread ID with hashing to the range (0~N-1), then the performance got better.
I guess there must be much better way of doing this, but I cannot find an exact one. Are there any ideas for improving what we have made?
Dealing with heap memory is a task for the OS. Nothing guarantees you can do a better/faster job than the OS does.
But there are some conditions where you can get a bit of improvement, specially when you know something about your memory usage that is unknown to the OS.
I'm writting here my untested idea, hope you'll get some profit of it.
Let's say you have T threads, all of them reserving and freeing memory. The main goal is speed, so I'll try not to use TLS, nor critical blocking, not atomic ops.
If (repeat: if, if, if) the app can fit to several discrete sizes of memory blocks (not random sizes, so as to avoid fragmentation and unuseful holes) then start asking the OS for a number of these discrete blocks.
For example, you have an array of n1 blocks each of size size1, an array of n2 blocks each of size size2, an array of n3... and so on. Each array is bidimensional, the second field just stores a flag for used/free block. If your arrays are very large then it's better to use a dedicated array for the flags (due to contiguous memory usage is always faster).
Now, some one asks for a block of memory of size sB. A specialized function (or object or whatever) searches the array of blocks of size greater or equal to sB, and then selects a block by looking at the used/free flag. Just before ending this task the proper block-flag is set to "used".
When two or more threads ask for blocks of the same size there may be a corruption of the flag. Using TLS will solve this issue, and critical blocking too. I think you can set a bool flag at the beggining of the search into flags-array, that makes the other threads to wait until the flag changes, which only happens after the block-flag changes. With pseudo code:
MemoryGetter(sB)
{
//select which array depending of 'sB'
for (i=0, i < numOfarrays, i++)
if (sizeOfArr(i) >= sB)
arrMatch = i
break //exit for
//wait if other thread wants a block from the same arrMatch array
while ( searching(arrMatch) == true )
; //wait
//blocks other threads wanting a block from the same arrMatch array
searching(arrMatch) = true
//Get the first free block
for (i=0, i < numOfBlocks, i++)
if ( arrOfUsed(arrMatch, i) != true )
selectedBlock = addressOf(....)
//mark the block as used
arrOfUsed(arrMatch, i) = true
break; //exit for
//Allow other threads
searching(arrMatch) = false
return selectedBlock //NOTE: selectedBlock==NULL means no free block
}
Freeing a block is easier, just mark it as free, no thread concurrency issue.
Dealing with no free blocks is up to you (wait, use a bigger block, ask OS for more, etc).
Note that the whole memory is reserved from the OS at app start, which can be a problem.
If this idea makes your app faster, let me know. What I can say for sure is that memory used is greater than if you use normal OS request; but not much if you choose "good" sizes, those most used.
Some improvements can be done:
Cache the last freeded block (per size) so as to avoid the search.
Start with not that much blocks, and ask the OS for more memory only
when needed. Play with 'number of blocks' for each size depending on
your app. Find the optimal case.

Python 3 multiprocessing: optimal chunk size

How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

Implementing Stack and Queue with O(1/B)

This is an exercise from this text book (page 77):
Exercise 48 (External memory stacks and queues). Design a stack data structure that needs O(1/B) I/Os per operation in the I/O model
from Section 2.2. It suffices to keep two blocks in internal memory.
What can happen in a naive implementation with only one block in
memory? Adapt your data structure to implement FIFOs, again using two
blocks of internal buffer memory. Implement deques using four buffer
blocks.
I don't want the code. Can anyone explain me what the question needs, and how can i do operations in O(1/B)?
As the book goes, quoting Section 2.2 on page 27:
External Memory: <...> There are special I/O operations that transfer B consecutive words between slow and fast memory. For
example, the external memory could be a hard disk, M would then be the
main memory size and B would be a block size that is a good compromise
between low latency and high bandwidth. On current technology, M = 1
GByte and B = 1 MByte are realistic values. One I/O step would then be
around 10ms which is 107 clock cycles of a 1GHz machine. With another
setting of the parameters M and B, we could model the smaller access
time difference between a hardware cache and main memory.
So, doing things in O(1/B) most likely means, in other words, using a constant number of these I/O operations for each B stack/queue operations.

fast thread ordering algorithm without atomic CAS

I am looking for an approach that will let me assign ordinal numbers 0..(N-1) to N O/S threads, such that the threads are in numeric order. That is, the thread that gets will have a lower O/S thread ID than the thread with ordinal 1.
In order to carry this out, the threads communicate via a shared memory space.
The memory ordering model is such that writes will be atomic (if two concurrent threads write a memory location at the same time, the result will be one or the other). The platform will not support atomic compare-and-set operations.
I am looking for an algorithm that is efficient in the number of writes to shared memory, and will complete rapidly with up to tens of thousands of threads, and with no nasty worse-case thread arrival conditions.
The O/S will assign thread numbers in arbitrary order throughout a 32-bit space. There may be arbitrary thread creation delays - the algorithm can be considered complete when all N threads are present.
I am unable to use the obvious solution of collecting all the threads, and then having one thread sort them - without an atomic operation, I have no way of safely collecting all the individual threads (another thread could rewrite the slot).
With no claim to being optimal in any sense (there are clearly faster ways to do this with atomic compare-and-set operations, or as Martin indicated, atomic increment)...
Assuming N is known to all the threads, and each thread has a unique non-zero ID value, such as its stack address in 32-bit space...
Use an array of size N in shared space; ensure that this array is initialized to zero.
Each thread owns the first slot in the array that holds an ID lower than or equal to the thread's ID; the thread writes its ID there. This continues until the array is full of non-zero values, and all the values are in decreasing order.
At the completion of the algorithm, the index of the thread's slot in the array is its ordinal number.
If I have this right you want to map Integer->Integer where the input in an arbitrary 32 bit number, and the output is a number from 0-N where N is the number of threads?
In that case, every time you create a new thread call this method, the returned value is the ID:
integer nextId = 0;
integer GetInteger()
{
return AtomicIncrement(nextId);
}
This algorithm is obviously O(N)
Assuming several things:
threads never die
You have some kind of atomic increment which increments a number and returns the old value or the new value, the difference is between 0 based and 1 based IDs
Threads call a method asking what their ID is just once when they're created

Resources