ArrayList as a n implementation of List in Java 8 - java-8

You use a java.util.ArrayList as the implementation for a java.util.List collection using:
List L = new ArrayList();
Referring to the scenario above, what happens when you add an element that exceeds the ArrayList's capacity?

There are several scenarios to your question:
You have enough memory to hold both the original array and a larger copy
Then it's the usual scenario, the capacity is automatically increased and you can add more elements.
You have few heap memory at your disposal.
In that case, you'll get an OutOfMemoryError when ArrayList tries to create a copy of its internal array with a bigger capacity.
You already have a capacity around Integer.MAX_VALUE - 8 and you have enough heap memory to create an even bigger array.
Then, you'll get an OutOfMemoryError if you're using a Hotspot/OpenJDK JVM. It's not documented anywhere in the List interface or their implementations, but according to tests Holger realized (see comments), this is what you get because creating an array of size Integer.MAX_VALUE throws such error. Other JVMs might throw other exceptions or errors.

ArrayList's capacity depends on available heap memory.
In case no more heap space is available when calling ArrayList.add, an OutOfMemoryError will be thrown

Related

Ignite uses more memory than expected

I am using Ignite to build a framework for data calculation. One big problem is the memory usage is a little more than expected. The data using 1G memory outside Ignite will use more than 1.5G in Ignite cache.
I turned off backup and copyOnRead already. I don't use query feature so no extra index space. I also counted in the extra space used for each cache and cache entry. The total memory usages still doesn't add up.
The data value for each cache entry is a big map contains list of primitive arrays. Each entry is about 120MB.
What can be the problem? The data structure or the configuration?
Ignite does introduce some overhead to your data and half of a GB doesn't sound too bad too me. I would recommend you to refer to this guide for more details: https://apacheignite.readme.io/docs/capacity-planning
Difference between expected and real memory usage arises from 2 main points:
Each entry takes constant overhead consists of objects providing support for processing entries in distributed computing environment.
E.g. you can declare integer local variable, it takes 4 bytes in the stack, but it's hard to make the variable long live and accessible from other places of program. So you have to create new Integer object, which consumes at least 16 bytes (300% overhead isn't it?). Going further, if you want to make this object mutable and safely acsessible by multiple threads, you have to create new AtomicReference and store your object inside. Total memory consumption will be at least 32 bytes... and so on. Every time we're extending object functionality, we get additional overhead, there is no other way.
Each entry stored inside a cache in a special serialized format. So the actual memory footprint of an entry depends on the format is used. By default Ignite uses BinaryMarshaller to convert an object to the byte array, and this array is stored inside a BinaryObject.
The reason is simple, distributed computing systems continiously exchange entries between nodes, and every entry in cache should be ready to be transferred as a byte array.
Please, read the article, it was recently updated. You could estimate entry overhead for small entries by hand, but for big entries you should inspect actual entry stored in the cache as a byte array. Look at the withKeepBinary method.

Gradle heap error on uploadArchives

I am trying to upload an archive that's 600MB in size.
I get this error:
Execution failed for task ':uploadArchives'.
> Java heap space
...
...
Caused by: java.lang.OutOfMemoryError: Java heap space
I have tried to set GRADLE_OPTS, JVM_OPTS and MAVEN_OPTS variables, for setting the max. heap size, like for example:
export GRADLE_OPTS=-Xmx1024m
gradle uploadArchives
But I am still getting the same error.
What am I missing here?
Ultimately you always have a finite max of heap to use no matter what platform you are running on. In Windows 32 bit this is around 2gb (not specifically heap but total amount of memory per process). It just happens that Java happens to make the default smaller (presumably so that the programmer can't create programs that have runaway memory allocation without running into this problem and having to examine exactly what they are doing).
So this given there are several approaches you could take to either determine what amount of memory you need or to reduce the amount of memory you are using. One common mistake with garbage collected languages such as Java or C# is to keep around references to objects that you no longer are using, or allocating many objects when you could reuse them instead. As long as objects have a reference to them they will continue to use heap space as the garbage collector will not delete them.
In this case you can use a Java memory profiler to determine what methods in your program are allocating large number of objects and then determine if there is a way to make sure they are no longer referenced, or to not allocate them in the first place. One option which I have used in the past is "JMP" http://www.khelekore.org/jmp/.
If you determine that you are allocating these objects for a reason and you need to keep around references (depending on what you are doing this might be the case), you will just need to increase the max heap size when you start the program. However, once you do the memory profiling and understand how your objects are getting allocated you should have a better idea about how much memory you need.
In general if you can't guarantee that your program will run in some finite amount of memory (perhaps depending on input size) you will always run into this problem. Only after exhausting all of this will you need to look into caching objects out to disk etc. At this point you should have a very good reason to say "I need Xgb of memory" for something and you can't work around it by improving your algorithms or memory allocation patterns. Generally this will only usually be the case for algorithms operating on large datasets (like a database or some scientific analysis program) and then techniques like caching and memory mapped IO become useful.
Run Java with the command-line option -Xmx, which sets the maximum size of the heap.
http://docs.oracle.com/javase/7/docs/technotes/tools/windows/java.html#nonstandard
The problem was because the actual size of package was a lot higher due to gradle not being able to handle symlinks.
When I manually handled the symlinks, the problem ended.
The JVM heap size can be set in the gradle.properties file, in the root directory of your gradle project. Like this:
org.gradle.jvmargs=-Xms256m -Xmx1024m

How to see hadoop's heap use?

I am doing a school work to analyze the use of heap in hadoop. It involves running two versions of a mapreduce program to calculate the median of the length of forum comments: the first one is 'memory-unconscious' and the reduce program handles in memory a list with the length of every comment; the second one is 'memory-conscious' and the reducer uses a very memory-efficient data structure to handle the data.
The purpose is to use both programs to process data of different sizes and watch how the memory usage goes up faster in the first one (until it eventually runs out of memory).
My question is: how can I obtain the heap usage of hadoop or the reduce tasks?
I thouth the counter "Total committed heap usage (bytes)" would cointain this data, but I have found both versions of the program return almost the same values.
Regarding the correctness of the programs, the 'memory-unconscious' one runs out of memory with a large input and fails, while the other one does not and is able to finish.
Thanks in advance
I don't know what memory-conscious data structure you are using(If you give which one then might help), But most of in-memory data structure utilizes virtual memory means is data structure size increases to some extent based on policy extra data element/s will be dump into virtual memory. Hence we does not result in Out-of-memory error. but in case memory-unconscious doesn't do that. In both the cases data structure size will remain same that's why you are getting same size. To get real memory usage by Reducer you can get it by:
New Feature added java 1.5 is Instrumentation interface by which you can get objects memory usage(getObjectSize). Nice article about it: LINK
/* Returns the amount of free memory in the Java Virtual Machine. Calling the gc method may result in increasing the value returned by freeMemory.*/
long freeMemory = Runtime.getRuntime().freeMemory()
/* Returns the maximum amount of memory that the Java virtual machine will attempt to use. If there is no inherent limit then the value Long.MAX_VALUE will be returned. */
long maximumMemory = Runtime.getRuntime().maxMemory();
/* Returns the total amount of memory in the Java virtual machine. The value returned by this method may vary over time, depending on the host environment.
Note that the amount of memory required to hold an object of any given type may be implementation-dependent. */
long totalMemory = Runtime.getRuntime().totalMemory()

Garbage collection and memory management in Erlang

I want to know technical details about garbage collection (GC) and memory management in Erlang/OTP.
But, I cannot find on erlang.org and its documents.
I have found some articles online which talk about GC in a very general manner, such as what garbage collection algorithm is used.
To classify things, lets define the memory layout and then talk about how GC works.
Memory Layout
In Erlang, each thread of execution is called a process. Each process has its own memory and that memory layout consists of three parts: Process Control Block, Stack and Heap.
PCB: Process Control Block holds information like process identifier (PID), current status (running, waiting), its registered name, and other such info.
Stack: It is a downward growing memory area which holds incoming and outgoing parameters, return addresses, local variables and temporary spaces for evaluating expressions.
Heap: It is an upward growing memory area which holds process mailbox messages and compound terms. Binary terms which are larger than 64 bytes are NOT stored in process private heap. They are stored in a large Shared Heap which is accessible by all processes.
Garbage Collection
Currently Erlang uses a Generational garbage collection that runs inside each Erlang process private heap independently, and also a Reference Counting garbage collection occurs for global shared heap.
Private Heap GC: It is generational, so divides the heap into two segments: young and old generations. Also there are two strategies for collecting; Generational (Minor) and Fullsweep (Major). The generational GC just collects the young heap, but fullsweep collect both young and old heap.
Shared Heap GC: It is reference counting. Each object in shared heap (Refc) has a counter of references to it held by other objects (ProcBin) which are stored inside private heap of Erlang processes. If an object's reference counter reaches zero, the object has become inaccessible and will be destroyed.
To get more details and performance hints, just look at my article which is the source of the answer: Erlang Garbage Collection Details and Why It Matters
A reference paper for the algorithm: One Pass Real-Time Generational Mark-Sweep Garbage Collection (1995) by Joe Armstrong and Robert Virding in
1995 (at CiteSeerX)
Abstract:
Traditional mark-sweep garbage collection algorithms do not allow reclamation of data until the mark phase of the algorithm has terminated. For the class of languages in which destructive operations are not allowed we can arrange that all pointers in the heap always point backwards towards "older" data. In this paper we present a simple scheme for reclaiming data for such language classes with a single pass mark-sweep collector. We also show how the simple scheme can be modified so that the collection can be done in an incremental manner (making it suitable for real-time collection). Following this we show how the collector can be modified for generational garbage collection, and finally how the scheme can be used for a language with concurrent processes.1
Erlang has a few properties that make GC actually pretty easy.
1 - Every variable is immutable, so a variable can never point to a value that was created after it.
2 - Values are copied between Erlang processes, so the memory referenced in a process is almost always completely isolated.
Both of these (especially the latter) significantly limit the amount of the heap that the GC has to scan during a collection.
Erlang uses a copying GC. During a GC, the process is stopped then the live pointers are copied from the from-space to the to-space. I forget the exact percentages, but the heap will be increased if something like only 25% of the heap can be collected during a collection, and it will be decreased if 75% of the process heap can be collected. A collection is triggered when a process's heap becomes full.
The only exception is when it comes to large values that are sent to another process. These will be copied into a shared space and are reference counted. When a reference to a shared object is collected the count is decreased, when that count is 0 the object is freed. No attempts are made to handle fragmentation in the shared heap.
One interesting consequence of this is, for a shared object, the size of the shared object does not contribute to the calculated size of a process's heap, only the size of the reference does. That means, if you have a lot of large shared objects, your VM could run out of memory before a GC is triggered.
Most if this is taken from the talk Jesper Wilhelmsson gave at EUC2012.
I don't know your background, but apart from the paper already pointed out by jj1bdx you can also give a chance to Jesper Wilhelmsson thesis.
BTW, if you want to monitor memory usage in Erlang to compare it to e.g. C++ you can check out:
Erlang Instrument Module
Erlang OS_MON Application
Hope this helps!

CUDA Stream compaction: understanding the concept

I am using CUDA/Thrust/CUDPP. As I understand, in Stream compaction, certain items in an array are marked as invalid and then "removed".
Now what does "removal" really mean here? Suppose the original array A and has length 6. If 2 elements are invalid (by whatever condition we may provide) then
Does the system create a new array of size 4 in GPU-memory to store the valid elements to get the final result?
OR does it physically remove the invalid elements from memory and shrink the original array
A down to size 4 keeping only the valid elements?
For either case, doesn't that mean that dynamic memory allocation is happening under the hood?
But I had heard that dynamic memory allocation is not possible in the CUDA world.
First, dynamic memory allocation is possible in CUDA on Compute Capability 2.0 and higher devices. The CUDA runtime library supports malloc/free and new/delete in __device__ functions. But that is not germane to the answer, really.
Typically a large-enough output array is provided (pre-allocated, often the same size as the input array) and the output is written to it. No dynamic allocation required, but there is potentially storage waste. This is what CUDPP and thrust do. An alternative would be to perform a count of valid elements first, then allocate the output GPU memory dynamically using cudaMalloc called from the host CPU.

Resources