What are the differences between "__GFP_NOFAIL" and "__GFP_REPEAT"? - memory-management

As per the documentation (https://www.linuxjournal.com/article/6930),
which says:
Flag Description
__GFP_REPEAT The kernel repeats the allocation if it fails.
__GFP_NOFAIL The kernel can repeat the allocation.
So, both of them may cause the kernel to repeat the allocation operation.
How can I choose between them?
What are the major differences?

That isn't really "documentation", but just an article on LinuxJournal. Granted, the author (Robert Love) is surely knowledgeable on the subject, but nonetheless those descriptions are quite imprecise and outdated (the article is from 2003).
The __GFP_REPEAT flag was renamed to __GFP_RETRY_MAYFAIL in kernel version 4.13 (see the relevant patchwork) and its semantics were also modified.
The original meaning of __GFP_REPEAT was (from include/linux/gfp.h kernel v4.12):
__GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
_might_ fail. This depends upon the particular VM implementation.
The name and semantic of this flag were somewhat unclear, and the new __GFP_RETRY_MAYFAIL flag has a much clearer name and description (from include/linux/gfp.h kernel v5.7.2):
%__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
procedures that have previously failed if there is some indication
that progress has been made else where. It can wait for other
tasks to attempt high level approaches to freeing memory such as
compaction (which removes fragmentation) and page-out.
There is still a definite limit to the number of retries, but it is
a larger limit than with %__GFP_NORETRY.
Allocations with this flag may fail, but only when there is
genuinely little unused memory. While these allocations do not
directly trigger the OOM killer, their failure indicates that
the system is likely to need to use the OOM killer soon. The
caller must handle failure, but can reasonably do so by failing
a higher-level request, or completing it only in a much less
efficient manner.
If the allocation does fail, and the caller is in a position to
free some non-essential memory, doing so could benefit the system
as a whole.
As per __GFP_NOFAIL you can find a detailed description in the same file:
%__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
cannot handle allocation failures. The allocation could block
indefinitely but will never return with failure. Testing for
failure is pointless.
New users should be evaluated carefully (and the flag should be
used only when there is no reasonable failure policy) but it is
definitely preferable to use the flag rather than opencode endless
loop around allocator.
Using this flag for costly allocations is _highly_ discouraged.
In short, the difference between __GFP_RETRY_MAYFAIL and __GFP_NOFAIL is that the former will retry allocating memory only a finite amount of times before eventually reporting failure, while the latter will keep trying indefinitely until memory is available and will never report failure to the caller, because it assumes that the caller cannot handle allocation failure.
Needless to say, the __GFP_NOFAIL flag must be used with care only in scenarios in which no other option is feasible. It is useful in that it avoids explicitly calling the allocator in a loop until a request succeeds (e.g. while (!kmalloc(...));), and thus it's more efficient.

Related

Can CPU Out-of-Order-Execution cause memory reordering?

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.
In my opinion, Out-of-Order-Execution can't cause reordering because the results are always retired in-order as mentioned in this question.
To make my question more clear, let's say we have such an relax memory consistency architecture:
It doesn't have store buffer and invalidate queues
It can do Out-of-Order-Execution
Can memory reordering still happen in this architecture?
Does memory barrier has two functions, one is forbidding the Out-of-Order execution, the other is flushing invalidation queue and draining store buffer?
Yes, out of order execution can definitely cause memory reordering, such as load/load re-ordering
It is not so much a question of the loads being retired in order, as of when the load value is bound to the load instruction. Eg Load1 may precede Load2 in program order, Load2 gets its value from memory before Load1 does, and eg if there is an intervening store to the location read by Load2, then Load/load reordering has occurred.
However, certain systems, such as Intel P6 family systems, have additional mechanisms to detect such conditions to obtain stronger memory order models.
In these systems all loads are buffered until retirement, and if a possible store is detected to such a buffered but not yet retired load, then the load and program order instructions are “nuked”, and execution is resumed art, e.g., Load2.
I call this Freye’s Rule snooping, after I learned that Brad Freye at IBM had invented it many years before I thought I had. I believe the standard academic reference is Gharachorloo.
I.e. it is not so much buffering loads until retirement, as it is providing such a detection and correction mechanism associated with buffering loads until retirement. Many CPUs provide buffering until retirement but do not provide this detection mechanism.
Note also that this requires something like snoop based cache coherence. Many systems, including Intel systems that have such mechanisms also support noncoherent memory, e.g. memory that may be cached but which is managed by software. If speculative loads are allowed to such cacheable but non-coherent memory regions, the Freye’s Rule mechanism will not work and memory will be weakly ordered.
Note: I said “buffer until retirement”, but if you think about it you can easily come up with ways of buffering not quite until retirement. E.g. you can stop this snooping when all earlier loads have them selves been bound, and there is no longer any possibility of an intervening store being observed even transitively.
This can be important, because there is quite a lot of performance to be gained by “early retirement“, removing instructions such as loads from buffering and repair mechanisms before all earlier instructions have retired. Early retirement can greatly reduce the cost of out of order hardware mechanisms.

V8 isolates mapped memory leaks

V8 developer is needed.
I've noticed that the following code leaks mapped memory (mmap, munmap), concretely the amount of mapped regions within cat /proc/<pid>/maps continuously grows and hits the system limit pretty quickly (/proc/sys/vm/max_map_count).
void f() {
auto platform = v8::platform::CreateDefaultPlatform();
v8::Isolate::CreateParams create_params;
create_params.array_buffer_allocator =
v8::ArrayBuffer::Allocator::NewDefaultAllocator();
v8::V8::InitializePlatform(platform);
v8::V8::Initialize();
for (;;) {
std::shared_ptr<v8::Isolate> isolate(v8::Isolate::New(create_params), [](v8::Isolate* i){ i->Dispose(); });
}
v8::V8::Dispose();
v8::V8::ShutdownPlatform();
delete platform;
delete create_params.array_buffer_allocator;
}
I've played a little bit with platform-linux.cc file and have found that UncommitRegion call just remaps region with PROT_NONE, but not release it. Probably thats somehow related to that problem..
There are several reasons why we recreate isolates during the program execution.
The first one is that creating new isolate along with discarding the old one is more predictable in terms of GC. Basically, I found that doing
auto remoteOldIsolate = std::async(
std::launch::async,
[](decltype(this->_isolate) isolateToRemove) { isolateToRemove->Dispose(); },
this->_isolate
);
this->_isolate = v8::Isolate::New(cce::Isolate::_createParams);
//
is more predictable and faster than call to LowMemoryNotification. So we monitor memory consumptions using GetHeapStatistics and recreate isolate when it hits the limit. Turns out we cannot consider GC activity as a part of code execution, this leads to bad user experience.
The second reason is that having isolate per code allows as to run several codes in parallel, otherwise v8::Locker will block second code for that particular isolate.
Looks like at this stage I have no choices and will rewrite application to have a pool of isolates and persistent context per code..of course this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all, but at least it will not leak.
PS. I've mentioned that we use GetHeapStatistics for memory monitoring. I want to clarify a little bit that part.
In our case its a big problem when GC works during code execution. Each code has execution timeout (100-500ms). Having GC activity during code execution locks code and sometimes we have timeouts just for assignment operation. GC callbacks don't give you enough accuracy, so we cannot rely on them.
What we actually do, we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
PPS. I also mentioned that sharing isolate between codes may affect users.
Say you have user#1 and user#2. Each of them have their own code, both are unrelated. code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
V8 developer is needed.
Please file a bug at crbug.com/v8/new. Note that this issue will probably be considered low priority; we generally assume that the number of Isolates per process remains reasonably small (i.e., not thousands or millions).
have a pool of isolates
Yes, that's probably the way to go. In particular, as you already wrote, you will need one Isolate per thread if you want to execute scripts in parallel.
this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all
No, that can't happen. Only allocations trigger GC activity. Allocation-free code will spend zero time doing GC. Also (as we discussed before in your earlier question), GC activity is split into many tiny (typically sub-millisecond) steps (which in turn are triggered by allocations), so in particular a short-running bit of code will not encounter some huge GC pause.
sometimes we have timeouts just for assignment operation
That sounds surprising, and doesn't sound GC-related; I would bet that something else is going on, but I don't have a guess as to what that might be. Do you have a repro?
we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
Have you tried not doing any of that? V8's GC is very finely tuned by default, and I would assume that side-stepping it in this way is causing more problems than it solves. Of course you can experiment with whatever you like; but if the resulting behavior isn't what you were hoping for, then my first suggestion is to just let V8 do its thing, and only interfere if you find that the default behavior is somehow unsatisfactory.
code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
Again: no. Code that doesn't allocate will not be interrupted by GC. And several functions in the same Isolate can never run in parallel; only one thread may be active in one Isolate at the same time.

Alternatives to dynamic allocations in safety critical projects (C)

Safety critical projects do not recommend any dynamic allocations or freeing allocated memory. Only during elaboration/initialization phase of the program execution, it is allowed.
I know most of you will argue to implement SW in terms where it should do all static allocations only or do some justification in the code that dynamic allocations will not harm the overall program,etc but still, Is there any alternative to this problem? Is there any way or any example to kind of allocate some (heap) memory during program initialization/elaboration and allocate/deallocate memory from over there? Or any solutions/alternatives to this problem if we really want dynamic allocations in the (safety critical) project?
This type of question is asked most often by developers who want to be able to use dynamic memory allocation within a safety-related system without "undue" restrictions - which quite often seems to mean they are not prevented from dynamically allocating memory in amounts they choose, when they choose, and (possibly) releasing that memory when they choose.
I'll address that question (can dynamic memory allocation be used in a critical system without restrictions?) first. Then I'll come back to options involving accepting some restrictions on how (when, or if) dynamic memory allocation is used.
Within a "safety critical project", such a thing is generally not possible. Safety related systems generally have mandatory requirements concerned with mitigating or eliminating specified hazards. Failure to adequately mitigate or eliminate specified hazards (i.e. to meet the requirements) can result in harm - for example, death or injury of people. In such systems, it is generally necessary to determine, to some level of rigour, that the hazards are appropriately and reliably mitigated or eliminated. A consequence of this is typically a set of requirements related to determinism - the ability to determine, through appropriate analysis, that the system completes actions in a specified manner - where attributes like behaviour and timing are tightly specified.
If dynamic memory allocation is used without restriction, it is difficult to determine if parts of the system behave as required. Types of problems include;
Fragmentation of unallocated memory. It is not possible to ensure that a request to allocate N contiguous bytes of memory will succeed, even if N bytes of memory are available. This is particularly true if there have previously been multiple allocations and deallocations in arbitrary order - even if N bytes of memory are available, they may not be in a contiguous parcel.
Sufficiency. It is often difficult to provide an assurance that a critical memory allocation, which must succeed, does actually succeed.
Appropriate release. It is difficult to prevent memory being released while it is still needed (resulting in potential to access memory that has been deallocated) or to ensure memory that is no longer need is actually released (e.g. prevent memory leaks).
Timeliness. Attempts to mitigate the preceding problems mean that the time of an allocation or of a deallocation is variable, unpredictable, with potentially no upper bound. Examples of approaches to deal with these are defragmentation (to deal with problems of fragmentation) or garbage collection (to deal with problems with sufficiency and/or with appropriate release). These processes take time and other system resources. If they are done when attempting an an allocation, the time to allocate memory becomes unpredictable. If they are done on releasing memory, the time to release memory becomes unpredictable. If they are done at other times, the behaviour of other - potentially critical - code may become unpredictable (e.g. the world effectively freezes for the application).
All of these factors, and more, mean that unrestricted dynamic memory allocation does not work well within requirements for determinism of timing or resource usage of the system. Inherently, system requirements require some restrictions to be imposed and, depending on the system, enforced.
If restrictions on dynamic memory allocation are acceptable, there are options. Generally, these techniques require support both in terms of policy constraints and technical solutions to encourage (preferably enforce, in high criticality systems) compliance with those policies. Policy enforcement may be technical (e.g. automated and manual design and code reviews, tailored development environments, compliance testing, etc etc) or organisational (e.g. dismissing developers who willfully work around key policies).
Examples of techniques include;
No dynamic allocation at all. i.e. static allocations only.
Only use dynamic memory allocation during system initialisation. This requires the maximum amount of memory that needs to be allocated to be determined in advance. If memory allocation fails, treat it like any POST (power-on-self-test) failure.
Allocate memory but never release it. This tends to avoid problems of fragmentation, but can make it more difficult to determine an upper bound on how much memory is needed by the system.
Custom allocation. The system (or application) explicitly manages dynamic memory allocation, rather than using generic library functions (e.g. those associated with the programming language of choice). This usually means introducing a custom allocator and forbidding (or disabling) use of generic library functions for dynamic memory management. The custom allocator must be explicitly engineered with needs of the particular system in mind.
Boxing in memory management. This is a particular type of custom allocation, where the application allocates a pool of memory, and functions request fixed amounts (or multiples of fixed amounts) from the pool. Because the pool is fixed by the application, the application to monitor how much memory from the pool is in use, and take actions to release memory if memory is exhausted. Allocations and deallocations from the pool can also be performed predictably (because some of the more general concerns with dynamic memory allocation are being managed). Critical systems may have multiple pools, each for exclusive use by specific sets of functions.
Partitioning. Explicitly prevent non-critical functions from accessing memory pools that have been established for use by critical functions. This allows an assurance that critical functions can access memory they need, and also helps ensure that failure of a low-criticality function cannot trigger failure of a high criticality function. Partitioning may be performed within an application, or within a (appropriately certified) host operating system, or both .... depending on needs of the system.
Some of these approaches can be used to support each other.

Is Visibility Problem in Java caused by JVM or Hardware?

Previously I think the Visibility Problem is cause by CPU Cache for performance.
But I saw this article: http://www.ibm.com/developerworks/java/library/j-5things15/index.html
In the paragraph 3. Volatile variables, it tells that Thread holds the cache, sounds like the cache is caused by JVM.
What's the answer? JVM or Hardware?
JVM gives you some weak guarantees. Compiler and Hardware cause you problems. :-)
When a thread reads a variable, it is not necessarily getting the latest value from memory. The processor might return a cached value. Additionally, even though the programmer authored code where a variable is first written and later read, the compiler might reorder the statements as long as it does not change the program semantics. It is quite common for processors and compilers to do this for performance optimization. As a result, a thread might not see the values it expects to see. This can result in hard to fix bugs in concurrent programs.
Most programmers are familiar with the fact that entering a synchronized block means obtaining a lock on a monitor that ensures that no other thread can enter the synchronized block. Less familiar but equally important are the facts that
(1) Acquiring a lock and entering a synchronized block forces the thread to refresh data from memory.
(2) Upon exiting the synchronized block, data written is flushed to memory.
http://www.javacodegeeks.com/2011/02/java-memory-model-quick-overview-and.html
See also JSR 133 (Java Memory Model and Thread Specification Revision) http://jcp.org/en/jsr/detail?id=133 It was released with JDK 1.5.

Best way to handle malloc failure in Cocoa

Although it won't happen often, there are a couple of cases where my Cocoa application will allocate very large amounts of memory, enough to make me worry about malloc failing. What is the best way to handle this sort of failure in a Cocoa application? I've heard that Exceptions are generally discouraged in this development environment but is this a case where they would be useful?
If you have an allocation fail because you are out of memory, more likely than not there has been an allocation error in some framework somewhere that has left the app in an undetermined state.
Even if that isn't the case, you can't do anything that'll allocate memory and that leaves you with very few options.
Even freeing memory in an attempt to "fix" the problem isn't going to consistently work, not even to "fix" it by showing a nice error message and exiting cleanly.
You also don't want to try and save data from this state. Or, at least, not without writing all the code necessary to deal with corrupt data on read (because it is quite possible that a failed allocation meant some code somewhere corrupted memory).
Treat allocation failures as fatal, log and exit.
It is extremely uncommon for a correctly written application to run out of memory. More likely, too, when an app runs out of memory, the user's system is going to be paging like hell and, thus, performance had degraded significantly long before the allocation failure.
Your return on investment for focusing on optimizing and reducing memory use will be orders of magnitude greater than trying to recover from an allocation failure.
(Alan's original answer was accurate as well as his edit).
If you're running into memory allocation errors, you shouldn't try to handle them, and instead rethink how your application uses memory.
I'm not sure what the Cocoa idioms are, but for C++ and C# at least, out of memory exceptions are a sign of larger problems and are best left to the user/OS to deal with.
Say your memory allocation fails, what else can your system do? How much memory is left? Is it enough to show a dialog/print a message, before shutting down? Will throw an exception succeed? Will cleaning up resources cause cascading memory exceptions?
If malloc fails, you will get a null back, so if that's the case, can your application continue without the memory? If not, treat the condition as a fatal error and exit with a user helpful message.
If you run out of memory there is usually not much you can do short of terminate your app. Even showing a notification could fail because there is not enough memory for that.
The standard in C applications is to write a void xmalloc(size_t size); function that will check the return value of malloc, and if NULL, print out an error to stderr and then call abort(). That way you just use xmalloc throughout your code and don't think about it. If you run out of memory, bad luck and your app will die.

Resources