I am using Ignite to build a framework for data calculation. One big problem is the memory usage is a little more than expected. The data using 1G memory outside Ignite will use more than 1.5G in Ignite cache.
I turned off backup and copyOnRead already. I don't use query feature so no extra index space. I also counted in the extra space used for each cache and cache entry. The total memory usages still doesn't add up.
The data value for each cache entry is a big map contains list of primitive arrays. Each entry is about 120MB.
What can be the problem? The data structure or the configuration?
Ignite does introduce some overhead to your data and half of a GB doesn't sound too bad too me. I would recommend you to refer to this guide for more details: https://apacheignite.readme.io/docs/capacity-planning
Difference between expected and real memory usage arises from 2 main points:
Each entry takes constant overhead consists of objects providing support for processing entries in distributed computing environment.
E.g. you can declare integer local variable, it takes 4 bytes in the stack, but it's hard to make the variable long live and accessible from other places of program. So you have to create new Integer object, which consumes at least 16 bytes (300% overhead isn't it?). Going further, if you want to make this object mutable and safely acsessible by multiple threads, you have to create new AtomicReference and store your object inside. Total memory consumption will be at least 32 bytes... and so on. Every time we're extending object functionality, we get additional overhead, there is no other way.
Each entry stored inside a cache in a special serialized format. So the actual memory footprint of an entry depends on the format is used. By default Ignite uses BinaryMarshaller to convert an object to the byte array, and this array is stored inside a BinaryObject.
The reason is simple, distributed computing systems continiously exchange entries between nodes, and every entry in cache should be ready to be transferred as a byte array.
Please, read the article, it was recently updated. You could estimate entry overhead for small entries by hand, but for big entries you should inspect actual entry stored in the cache as a byte array. Look at the withKeepBinary method.
Related
I am working on some project related with huge objects in physical memory in Windows.
I wanted to create really big structure of data, but therefore I found some problems.
While I am trying to allocate huge amount of data I can just create as large object as heap allows (it also depends on architecture of operating system).
I am not sure if this is restricted by private heap of thread, or some other way.
When I was looking for how operating system places data in memory, I found out that the data is stored in some particular order.
And here comes some questions...
If I want to create large objects, should I have one very large heap region to allocate memory inside? If so, I have to fragmentate data.
In other way, there came an idea, of finding starting addresses of empty regions, and then use this unused place to put data in some data structure.
If this idea is possible to realize, then how it could be done?
Another question is that, do you think that list would be the best option for that sort of huge object? Or maybe it would be better to use another data structure?
Do you think that chosen data structure could be divided into two regions of data separately, but standing as one object?
Thanks in advance, every answer for my questions could be helpful.
There seems to be some kind of misconception about memory allocation here.
(1) Most operating systems do not allocate memory linearly. There usually are discontinuities in the memory mapped to a process address space.
(2) If you want to allocate a huge amount of memory, you should do it directly with the operating system; not through a heap.
I am doing a school work to analyze the use of heap in hadoop. It involves running two versions of a mapreduce program to calculate the median of the length of forum comments: the first one is 'memory-unconscious' and the reduce program handles in memory a list with the length of every comment; the second one is 'memory-conscious' and the reducer uses a very memory-efficient data structure to handle the data.
The purpose is to use both programs to process data of different sizes and watch how the memory usage goes up faster in the first one (until it eventually runs out of memory).
My question is: how can I obtain the heap usage of hadoop or the reduce tasks?
I thouth the counter "Total committed heap usage (bytes)" would cointain this data, but I have found both versions of the program return almost the same values.
Regarding the correctness of the programs, the 'memory-unconscious' one runs out of memory with a large input and fails, while the other one does not and is able to finish.
Thanks in advance
I don't know what memory-conscious data structure you are using(If you give which one then might help), But most of in-memory data structure utilizes virtual memory means is data structure size increases to some extent based on policy extra data element/s will be dump into virtual memory. Hence we does not result in Out-of-memory error. but in case memory-unconscious doesn't do that. In both the cases data structure size will remain same that's why you are getting same size. To get real memory usage by Reducer you can get it by:
New Feature added java 1.5 is Instrumentation interface by which you can get objects memory usage(getObjectSize). Nice article about it: LINK
/* Returns the amount of free memory in the Java Virtual Machine. Calling the gc method may result in increasing the value returned by freeMemory.*/
long freeMemory = Runtime.getRuntime().freeMemory()
/* Returns the maximum amount of memory that the Java virtual machine will attempt to use. If there is no inherent limit then the value Long.MAX_VALUE will be returned. */
long maximumMemory = Runtime.getRuntime().maxMemory();
/* Returns the total amount of memory in the Java virtual machine. The value returned by this method may vary over time, depending on the host environment.
Note that the amount of memory required to hold an object of any given type may be implementation-dependent. */
long totalMemory = Runtime.getRuntime().totalMemory()
I have a quick question:
How can we Use a 3GB Free Space to Access roughly 30 GB of data without Virtual Memory or Compression? It's more of a Data Structure Question.
Thanks
You should somehow mimic the paging mechanism.
One way to do it is hashing1.
Hash all your data into bins, and store these bins in disk. In your main memory (RAM) you will only hold an array of pointers to disk. Once you need an address, you know where it is on disk by accessing the RAM and taking the pointer from the location hash(address)
You can of course optimize it to keep a portion of the data in memory - using the principle of locality - and hoping to get a hit - and avoid reloading a chunk from disk.
(1) The hashing does not have to be complex or uniformly distributed. I believe using the MSb's of the address will be just fine - and will actually mimic the paging mechanism better.
The most obvious way would be through a typical filesystem API with read, write, and seek functions.
I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.
I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?
For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.
My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.
I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:
1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.
We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.
When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)
One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.
Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.
Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.
For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.
The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:
no need to store the size explicitly, it's known depending on the bag you're in
no "real" fragmentation within a bag
The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.
You need an index to map size to its corresponding bag.
You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).
This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.
This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.
(*) BiBOP: Big Bag Of Pages
I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.
What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.
Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.
On one side - best compaction, highest fragmentation - splay and many other kinds of trees.
On another side - lowest fragmentation, worst compaction - linked list.
In between there are B-Trees and others.
As you I understand, you stated as priority: space-saving - while taking care about performance.
I would recommend you mixed data-structure in order to achieve all requirements.
a kind of list of contiguous blocks of data
a kind of tree for current "add/remove" operation
when data are required on demand, allocate from tree. When deleted, keep track what's "deleted" using tree as well.
mixing -> during each operation (or on idle moments) do "step by step" de-fragmentation, and apply changes kept in tree to contiguous blocks, while moving them slowly.
This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.
The most simple solution is a free list: keep a linked list of free blocks, reusing the free space to store the address of the next block in the list.
Several of the resources I've gone to on the internet have disagree on how set associative caching works.
For example hardware secrets seem to believe it works like this:
Then the main RAM memory is divided in
the same number of blocks available in
the memory cache. Keeping the 512 KB
4-way set associative example, the
main RAM would be divided into 2,048
blocks, the same number of blocks
available inside the memory cache.
Each memory block is linked to a set
of lines inside the cache, just like
in the direct mapped cache.
http://www.hardwaresecrets.com/printpage/481/8
They seem to be saying that each cache block(4 cache lines) maps to a particular block of contiguous RAM. They are saying non-contiguous blocks of system memory(RAM) can't map to the same cache block.
This is there picture of how hardwaresecrets thinks it works
http://www.hardwaresecrets.com/fullimage.php?image=7864
Contrast that with wikipedia's picture of set associative cache
http://upload.wikimedia.org/wikipedia/commons/9/93/Cache%2Cassociative-fill-both.png.
Brown disagrees with hardware secrets
Consider what might happen if each
cache line had two sets of fields: two
valid bits, two dirty bits, two tag
fields, and two data fields. One set
of fields could cache data for one
area of main memory, and the other for
another area which happens to map to
the same cache line.
http://www.spsu.edu/cs/faculty/bbrown/web_lectures/cache/
That is, non-contiguous blocks of system memory can map to the same cache block.
How are the relationships between non-contiguous blocks on system memory and cache blocks created. I read somewhere that these relationships are based on cache strides, but I can't find any information on cache strides other than that they exist.
Who is right?
If striding is actually used, how does striding work and do I have the correct technical name? How do I find the stride for a particular system? is it based on the paging system? Can someone point me to a url that explains N-way set associative cache in great detail?
also see:
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/set.html
When I teach cache memory architecture to my students, I start with a direct-mapped cache. Once that is understood, you can think of N-way set associative caches as parallel blocks of direct-mapped cache. To understand that both figures may be correct, you need to first understand the purpose of set-assoc caches.
They are designed to work around the problem of 'aliasing' in a direct-mapped cache, where multiple memory locations can map to a specific cache entry. This is illustrated in the Wikipedia figure. So, instead of evicting a cache entry, we can use a N-way cache to store the other 'aliased' memory locations.
In effect, the hardware secrets diagram would be correct assuming the order of replacement is such that the first chunk of main memory is mapped to Way-1 and then the second chunk to Way-2 and so on so forth. However, it is equally possible to have the first chunk of main memory spread over multiple Ways.
Hope this explanation helps!
PS: Contiguous memory locations are only needed for a single cache line, exploiting spatial locality. As for the latter part of your question, I believe that you may be confusing several different concepts.
The replacement policy decides where in the cache a copy of a
particular entry of main memory will go. If the replacement policy is
free to choose any entry in the cache to hold the copy, the cache is
called fully associative. At the other extreme, if each entry in main
memory can go in just one place in the cache, the cache is direct
mapped. Many caches implement a compromise in which each entry in main
memory can go to any one of N places in the cache, and are described
as N-way set associative