The following excerpt from The Definitive Guide provides high level details as shown below but
what exactly is virtual memory is referring to in this task counter?
How to interpret it? How is it related to PHYSICAL_MEMORY_BYTES?
Following is an example extract from one of the jobs. Physical is 214 GB approx. and virtual is 611 GB approx.
1.What exactly is virtual memory is referring to in this task counter?
Virtual Memory here is used to prevent Out of Memory errors of a task,if data size doesn't fits in RAM(physical mem).
in RAM.So a portion of memory of size what didn't fit in RAM will be used as Virtual Memory.
So,while setting up hadoop cluster one is advised to have the value of vm.swappiness =1 to achieve better performance. On linux systems, vm.swappiness is set to 60 by default.
Higher the value more aggresive swapping of memory pages.
https://community.hortonworks.com/articles/33522/swappiness-setting-recommendation.html
2. How to interpret it? How is it related to PHYSICAL_MEMORY_BYTES?
swapping of memory pages from physical memory to virtual memory on disk when not enough phy mem
This is the relation between PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES.
I'm getting this memory error on my reducers:
Container is running beyond physical memory limits. Current usage: 6.1 GB of 6 GB physical memory used; 10.8 GB of 30 GB virtual memory used.
So yeah, there's a physical memory issue what can be resolved by increasing mapreduce.reduce.memory.mb BUT I don't understand why it happens.
The more data is getting into the pipe, the more likely it is for this memory issue to occur. The thing is that most of my reducers (about 90%) pass and memory was supposed to be freed as reducers pass because data should have been written to the disk already.
What am I missing here?
In YARN by default, containers are pre allocated with memory. In this case all reducer containers will have 6 GB memory regardless of free memory across cluster. The JVM of these containers will not be re-used or shared across containers(not at least in Hadoop2) , that means If the memory on one reducer goes beyond its memory 6 GB limit, it will not grab resources from other free containers (if that's your concern)
Now , why only few reducers are always going above its memory(given 90% passed) hints a possible SKEW in data , which means this reducer might be processing more input groups or more keys than others.
I'v searched about this but i don't seem to get fair answer.
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
is it possible to create one vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
Update 1: lets say i'm using virtualBox, and the term vCPU is referring to virtual cpu, and it's a well known term.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters, and i wan't to know if the word "dynamic" really means allocating virtual cpus dynamically from different hardwares, like bare-metal servers. i don't know if i was searching in the wrong place but no one really answers this question in the docs.
Unfortunately, I have to start by saying that I completely disagree with the answer from OSGX (and I have to start with that as the rest of my answer depends on it). There are documented cases where aggregating CPU power of multiple physical systems into a single system image work great. Even about the comment regarding ScaleMP ...solutions can be ranged from "make target application slower" to "make target application very-very slow" ... - all one needs to do to invalidate that claim is to check the top-rated machines in the SPEC CPU benchmark lists to see machines using ScaleMP are in the top 5 SMPs ever built for performance on this benchmark.
Also, from computer architecture perspective, all large scale machines are essentially a collection of smaller machines with a special fabric (Xbar, Numalink, etc.) and some logic/chipset to manage cache coherence. today's standard fabrics (PCIe Switching, InfiniBand) are just as fast, if not faster, than those proprietary SMP interconnects. Will OSGX claim those SMPs are also "very-very-slow"?
The real question, as with any technology, is what are you trying to achieve. Most technologies are a good fit for one task but not the other. If you are trying to build a large machine (say, combine 16 servers, each with 24 cores, into a 384-core SMP), on-top of which you will be running small VMs, each using single digit number of vCPUs, then this kind of SSI solution would probably work very nicely as to the underlying infrastructure you are merely running a high-throughput computing (HTC) job - just like SPEC CPU is. However, if you are running a thread-parallel software that excessively uses serializing elements (barriers, locks, etc) that require intensive communication between all cores - then maybe you won't see any benefit.
As to the original question on the thread, or rather, the "Update 2" by the author:
...I'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters...
Indeed, there is not a lot of technology out there that enables the creation of a single system from CPUs across a cluster. The technology mentioned earlier, from ScaleMP, does this but only at a physical server granularity (so, if you have a cluster of 100 servers and each cluster node has 24 cores, then you can "dynamically" create virtual machines of 48 cores (2 cluster nodes), 72 cores (3 cluster nodes), and so on, but you could not create a machine with 36 cores (1.5 cluster nodes), nor combine a few vacant CPUs from across different nodes - you either use all the cores from a node to combine into a virtual SMP, or none at all.
I'll use term vCPU as virtual cores and pCPU as physical cores, as it is defined by virtualbox documentation: https://www.virtualbox.org/manual/ch03.html#settings-processor
On the "Processor" tab, you can set how many virtual CPU cores the guest operating systems should see. Starting with version 3.0, VirtualBox supports symmetrical multiprocessing (SMP) and can present up to 32 virtual CPU cores to each virtual machine. You should not, however, configure virtual machines to use more CPU cores than you have available physically (real cores, no hyperthreads).
And I will try to answer your questions:
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
If you want to create virtual machine (with single OS image, SMP machine) all virtual cores should have shared memory. Two physical machines each of 5 cores have in sum 10 cores, but they have no shared memory. So, with classic virtualization software (qemu, kvm, xen, vmware, virtualbox, virtualpc) you is not able to convert two physical machine into single virtual machine.
is it possible to create that vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
No.
Regular physical machine have one or more CPU chips (sockets) and each chip has one or more cores. First PC had 1 chip with one core; there were servers with two sockets with one core in each. Later multicore chips were made, and huge servers may have 2, 4, 6 or sometimes even 8 sockets, with some number of cores per socket. Also, physical machine has RAM - dynamic computer memory, which is used to store data. Earlier multisocket systems had single memory controller, current multisocket systems have several memory controllers (MC, 1-2 per socket, every controller with 1, 2, or sometimes 3 or 4 channels of memory). Both multicore and multisocket systems allow any CPU core to access any memory, even if it is controlled by MC of other socket. And all accesses to the system memory are coherent (Memorycoherence, Cachecoherence) - any core may write to memory and any other core will see writes from first core in some defined order (according to Consistency model of the system). This is the shared memory.
"two physical" chips of two different machines (your PC and your laptop) have not connected their RAM together and don't implement in hardware any model of memory sharing and coherency. Two different computers interacts using networks (Ethernet, Wifi, .. which just sends packets) or files (store file on USB drive, disconnect from PC, connect to laptop, get the file). Both network and file sharing are not coherent and are not shared memory
i'm using virtualBox
With VirtualBox (and some other virtualization solutions) you may allocate 8 virtual cores for the virtual machine even when your physical machine has 4 cores. But VMM will just emulate that there are 8 cores, scheduling them one after one on available physical cores; so at any time only programs from 4 virtual cores will run on physical cores (https://forums.virtualbox.org/viewtopic.php?f=1&t=30404 " core i7, this is a 4 core .. I can use up to 16 VCPU on virtual Machine .. Yes, it means your host cores will be over-committed. .. The total load of all guest VCPUs will be split among the real CPUs."). In this case you will be able to start 10 core virtual machine on 5 core physical, and application which want to use 10 cores will get them. But performance of the application will be not better as with 5 real CPUs, and it will be less, because there will be "virtual CPU switching" and frequent synchronization will add extra overhead.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning
If you want to research about "dynamic provisioning", ask about it, not about "running something unknown on two PC at the same time)
in HPC clusters,
There are no single type of "HPC" or "HPC clusters". Different variants of HPC will require different solutions and implementations. Some HPC tasks needs huge amounts of memory (0.25, 0.5, 1, 2 TB) and will run only on shared-memory 4- or 8-socket machines, filled with hugest memory DIMM modules. Other HPC tasks may use GPGPU a lot. Third kind will combine thread parallelism (OpenMP) and process parallelism (MPI), so applications will use shared memory while threads of it runs on single machine, and they will send and receive packets over network to work collectively on one task while running on several (thousands) physical machines. Fourth kind of HPC may want to have 100 or 1000 TB of shared memory; but there are no SMP / NUMA machines with such amounts, so application can be written in Distributed shared memory paradigm/model (Distributed global address space DGAS, Partitioned global address space PGAS) to run on special machines or on huge clusters. Special solutions are used, and in PGAS the global shared memory of 100s TB is emulated from many computers which are connected with network. Program is written in special language or just use special library functions to access memory (list of special variants from Wikipedia: PGAS "Unified Parallel C, Coarray Fortran, Split-C, Fortress, Chapel, X10, UPC++, Global Arrays, DASH and SHMEM"). If the address or the request is in local memory, use it; if it is in memory of other machine, send packet to that machine to request data from memory. Even with fastest (100 Gbit/s) special networks with RDMA capability (network adapter may access memory of the PC without any additional software processing of incoming network packet) the difference between local memory and memory of remote computer is speed: you have higher latency of access and you have lower bandwidth when memory is remote (remote memory is slower than local memory).
If you say "vCPU must have 10 cores" we can read this as "there is application which want 10 core of shared memory system". In theory it is possible to emulate shared memory for application (and it can be possible to create virtualization solution which will use resources from several PC to create single virtual pc with more resources), but in practice this is very complex task and the result probably will has too low performance. There is commercial ScaleMP (very high cost; Wikipedia: ScaleMP "The ScaleMP hypervisor combines x86 servers to create a virtual symmetric multiprocessing system. The process is a type of hardware virtualization called virtualization for aggregation.") and there was commercial Cluster OpenMP from Intel (https://software.intel.com/sites/default/files/1b/1f/6330, https://www.hpcwire.com/2006/05/19/openmp_on_clusters-1/) to convert OpenMP programs (uses threads and shared memory) into MPI-like software with help of library and OS-based handlers of access to remote memory. Both solutions can be ranged from "make target application slower" to "make target application very-very slow" (internet search of scalemp+slow and cluster+openmp+slow), as computer network is always slower that computer memory (network has greater distance than memory - 100m vs 0.2m, network has narrow bus of 2, 4 or 8 high-speed pairs while memory has 64-72 high-speed pairs for every memory channel; network adapter will use external bus of CPU when memory is on internal interface, most data from network must be copied to the memory to become available to CPU).
and i wan't to know if the word "dynamic" really means
no one really answers this question in the docs.
If you want help from other people, show us the context or the docs you have with the task. It can be also useful to you to better understand some basic concepts from computing and from cluster computing (Did you have any CS/HPC courses?).
There are some results from internet search request like "dynamic+provisioning+in+HPC+clusters", but we can't say is it the same HPC variant as you want or not.
We currently have a ArangoDB cluster running on version 3.0.10 for a POC with about 81 GB stored on disk and Main memory consumption of about 98 GB distributed across 5 Primary DB servers. There are about 200 million vertices and 350 million Edges. There are 3 Edge collections and 3 document collections, most of the memory(80%) is consumed due to the presence of the edges
I'm exploring methods to decrease the main memory consumption. I'm wondering if there are any methods to compress/serialize the data so that less amount of main memory is utilized.
The reason for decreasing memory is to reduce infrastructure costs, I'm willing to trade-off on speed for my use case.
Please can you let me know, if there any Methods to reduce main memory consumption for ArangoDB
It took us a while to find out that our original recommendation to set vm.overcommit_memory to 2 this is not good in all situations.
It seems that there is an issue with the bundled jemalloc memory allocator in ArangoDB with some environments.
With an vm.overcommit_memory kernel settings value of 2, the allocator had a problem of splitting existing memory mappings, which made the number of memory mappings of an arangod process grow over time. This could have led to the kernel refusing to hand out more memory to the arangod process, even if physical memory was still available. The kernel will only grant up to vm.max_map_count memory mappings to each process, which defaults to 65530 on many Linux environments.
Another issue when running jemalloc with vm.overcommit_memory set to 2 is that for some workloads the amount of memory that the Linux kernel tracks as "committed memory" also grows over time and does not decrease. So eventually the ArangoDB daemon process (arangod) may not get any more memory simply because it reaches the configured overcommit limit (physical RAM * overcommit_ratio + swap space).
So the solution here is to modify the value of vm.overcommit_memory from 2 to either 1 or 0. This will fix both of these problems.
We are still observing ever-increasing virtual memory consumption when using jemalloc with any overcommit setting, but in practice this should not cause problems.
So when adjusting the value of vm.overcommit_memory from 2 to either 0 or 1 (0 is the Linux kernel default btw.) this should improve the situation.
Another way to address the problem, which however requires compilation of ArangoDB from source, is to compile a build without jemalloc (-DUSE_JEMALLOC=Off when cmaking). I am just listing this as an alternative here for completeness. With the system's libc allocator you should see quite stable memory usage. We also tried another allocator, precisely the one from libmusl, and this also shows quite stable memory usage over time. The main problem here which makes exchanging the allocator a non-trivial issue is that jemalloc has very nice performance characteristics otherwise.
(quoting Jan Steemann as can be found on github)
Several new additions to the rocksdb storage engine were made meanwhile. We demonstrate how memory management works in rocksdb.
Many Options of the rocksdb storage engine are exposed to the outside via options.
Discussions and a research of a user led to two more options to be exposed for configuration with ArangoDB 3.7:
--rocksdb.cache-index-and-filter-blocks-with-high-priority
--rocksdb.pin-l0-filter-and-index-blocks-in-cache
--rocksdb.pin-top-level-index-and-filter
It might be a trivial question, but I feel like I need to ask this question:
When Heroku says that I have 512 MB of RAM, and 10 process types, does this mean I have 512MB of RAM per each process, or the 512MB is divided by the number of processes I use, e.g 512MB/10 = 51.2MB per process?
If it's the latter, doesn't it make the unlimited number of processes in Heroku useless? I don't understand this
Each dyno is an independent container running on a different instance. You can see them as a different server.
That means each running process will get it's own memory and CPU. The 512MB are therefore not divided by the number of processes.