In my HPC application, i sometimes end up filling all the memory through dynamic allocation and loose performance. I am using MPI-fortran. Is there a programming model/environment where the environment could spawn new process when i am close to fill the memory? I can imagine how such a model would require me to explicitly specify how problem will be partitioned. Has any such thing been implemented yet?
Related
I am building a ML application for binary classification using ML.NET. It will have multiple ML models of varying sizes (built using different training data) which will be stored in SQL server database as Blob. Clients will send items for classification to this app in random order and based on client ID, corresponding model is to be used for classification. To classify item, model needs be read from database and then loaded into memory. Loading model in memory is taking considerable time depending on size and I don't see any way to optimize it. Hence I am planning to cache models in memory. If I cache many heavy models, it may put pressure on memory hampering performance of other processes running on server. So there is no straightforward way to limit caching. So looking for suggestions to handle this.
Spawn a new process
In my opinion this is the only viable option to accomplish what you're trying to do. Spawn a complete new process that communicates (via IPC?) with your "main application". You could set a memory limit using this property https://learn.microsoft.com/en-us/dotnet/api/system.gcmemoryinfo.totalavailablememorybytes?view=net-5.0 or maybe even use a 3rd-party-library (e.g. https://github.com/lowleveldesign/process-governor), that kills your process if it reaches a specific amount of RAM. Both of these approaches are quite rough and will basically kill your process.
If you have control over your side car application running, it might make sense to really monitor the RAM usage with something like this Getting a process's ram usage and gracefully stop the process.
Do it yourself solution (not recommended)
Basically there is no built in way of limiting memory usage by thread or similar.
What counts towards the memory limit?
Shared resources
Since you have a running process, you need to define what exactly counts towards the memory limit. For example if you have some static Dictionary that is manipulated by the running thread - what did it occupy? Only the diff between the old value and the new value? The whole new value? The key and the value?
There are many more cases like this you'll have to take into consideration.
The actual measuring
You need some kind of way to count the actual memory usage. This will probably be hard/near impossible to "implement":
Reference counting needed?
If you have a hostile thread, it might spawn an infinite amount of references to one object, no new keyword used. For each reference you'd have to count 32/64 bits.
What about built in types?
It might be "easy" to measure a byte[] included in your own type definition, but what about built in classes? If someone initializes a string with 100MB this might be an amount you need to keep track of.
... and many more ...
As you maybe noticed with previous samples, there is no easy definition of "RAM used by a thread". This is the reason there also is no easy to get the value of it.
In my opinion it's insanely complex to do such a thing and needs a lot of definition work to do on your side. It might be feasable with lots of effort but I'm not sure if that really is what you want. Even if you manage to - what will do you about it? Only killing the thread might not clean up the ressources.
Therefore I'd really think about having a OS managed, independent, process, that you can kill whenever you feel like it.
How big are your models? Even large models 100meg+ load pretty quickly off of fast/SSD storage. I would consider caching them on fast drives/SSDs, because pulling off of SQL Server is going to be much slower than raw disk. See if this helps your performance.
If all my processors share the same memory, is using MPI anyhow useful, instead of going full OpenMP ?
If you never intend to scale your application beyond a single shared-memory node, then OpenMP parallelisation might be relatively easier to implement in comparison to MPI parallelisation. Relatively, because the apparent simplicity of OpenMP is very misleading. In order to utilise the full ability of modern shared-memory machines, one should maximise data locality and use lots of private data, effectively treating them (the machines) as distributed memory systems. Also, the most prevailing error in shared memory programming are data races and those in times could be very hard to debug, even when armed with special thread-checker tools. Data races are virtually absent in MPI programming since processes do not share data.
That said, even when MPI processes communicate using shared memory, that is still slower than directly accessing the shared memory in a threaded process. Also some algorithms require some global data, which takes more memory with MPI where each process has to hold a copy of that data. This is curable in MPI-3.0 using shared-memory windows with single-sided operations, but that's somehow cumbersome (though portable). Also there are research efforts to reduce the intra-node communication overhead to as little as possible and some are very successful.
I worked on VxWorks 5.5 long time back and it was the best experience working on world's best real time OS. Since then I never got a chance to work on it again. But, a question keeps popping to me, what makes is so fast and deterministic?
I have not been able to find many references for this question via Google.
So, I just tried thinking what makes a regular OS non-deterministic:
Memory allocation/de-allocation:- Wiki says RTOS use fixed size blocks, so that these blocks can be directly indexed, but this will cause internal fragmentation and I am sure this is something not at all desirable on mission critical systems where the memory is already limited.
Paging/segmentation:- Its kind of linked to Point 1
Interrupt Handling:- Not sure how VxWorks implements it, as this is something VxWorks handles very well
Context switching:- I believe in VxWorks 5.5 all the processes used to execute in kernel address space, so context switching used to involve just saving register values and nothing about PCB(process control block), but still I am not 100% sure
Process scheduling algorithms:- If Windows implements preemptive scheduling (priority/round robin) then will process scheduling be as fast as in VxWorks? I dont think so. So, how does VxWorks handle scheduling?
Please correct my understanding wherever required.
I believe the following would account for lots of the difference:
No Paging/Swapping
A deterministic RTOS simply can't swap memory pages to disk. This would kill the determinism, since at any moment you could have to swap memory in or out.
vxWorks requires that your application fit entirely in RAM
No Processes
In vxWorks 5.5, there are tasks, but no process like Windows or Linux. The tasks are more akin to threads and switching context is a relatively inexpensive operation. In Linux/Windows, switching process is quite expensive.
Note that in vxWorks 6.x, a process model was introduced, which increases some overhead, but mainly related to transitioning from User mode to Supervisor mode. The task switching time is not necessarily directly affected by the new model.
Fixed Priority
In vxWorks, the task priorities are set by the developer and are system wide. The highest priority task at any given time will be the one running. You can thus design your system to ensure that the tasks with the tightest deadline always executes before others.
In Linux/Windows, generally speaking, while you have some control over the priority of processes, the scheduler will eventually let lower priority processes run even if higher priority process are still active.
How is memory allocated in slave nodes for execution of MPI programs ? How do slave nodes know the amount of memory to reserve ? What happens when a slave node can't find the data that it wants to access ?
This is not a homework problem , but a question that I tried came up in my mind and could'nt find on googling
With a non-specific question, the best answer you can expect will also be non-specific
When programming using MPI you typically write a single program which is launched (via mpirun/mpiexec, or some batching system eg. torque) on a set of notes.
The master-slave model is but one approach.
The memory allocation is typically under program control, just as you would in any application allocate memory as needed, so to in your MPI program.
As to finding the data, it is often provided to them (directly or indirectly) (by the master
process, if the master-slave model is used). If indeed each MPI instance has to "search" for the data it is to be processing, then as with any program that is unable to find what it requires, it should send a suitable error message/status back to the caller (or the master process)
.PMCD.
I'm putting into production some RPGLE code which uses %alloc and dealloc to allocate memory. Programmers should be able to ensure there are no resulting memory leaks but I'm worried about what happens if they don't.
My question is: if programmers mess up and there are memory leaks then when will this memory be reclaimed? Is it when the program leaves memory or when the job finishes?
From the ILE RPG Programmer's Reference Guide:
Storage is implicitly freed when the
activation group ends. Setting LR on
will not free any heap storage
allocated by the module, but any
pointers to heap storage will be lost.
If your RPG program is in its own activation group, then the memory will be freed when the program ends. Of course, when your job ends, so does your activation group. So ending the job will always clean up any memory allocated.
It sounds like you are approaching RPG from a C/C++ background. I've been programming in RPG for about 8 years now and only a handful of times ever had to use the %alloc() BIF.
That being said if you are using a new activation group, you should be fine. If you are using a named activation group and you do not issue the RCLACTGRP command or you are using the default activation group you could run into issues.
Indeed, you have to study the mechanism of activation groups. Memory leaks may happen, but will not do any damage to the machine (I love the as400). But you can harm the other programs within your iSeries job (remark: if you are not from a as400 background, you have to read about the as400 job mechanism).
If you start with managing the activationgroups within your job yourself (in the program that is ofcourse), you can create separate, sort of memory area's. It requires some overhead (you have to name the groups) but then you have a safe environment where you can powerfull stuff.
I am not familiar with those built-in-functions, but normally everything is cleaned up when the job ends (or user logs off if interactive). If you can't find an answer, I can point you to another community were your answer may be known.
Just happen to see this blogs now, way late but who knows others out there might still find this useful.
%alloc, dealloc uses job's default heap so it will be cleaned up when job ends.
There is another type of heaps, which you can use programmatically via CEE APIs, and it uses user defined heaps -- this is the one i think that you need to manage or clean up programmatically coz if not i think it might cause memory leakage.