What does virtual core in YARN vcore mean?

What does virtual core in YARN vcore mean? - hadoop

Yarn is using the concept of virtual core to manage CPU resources. I would ask what's the benefit to use virtual core, is there some reason here that YARN uses vcore?

Here is what the documentation states (emphasis mine)
A node's capacity should be configured with virtual cores equal to its
number of physical cores. A container should be requested with the
number of cores it can saturate, i.e. the average number of threads it
expects to have runnable at a time.
Unless the CPU core is hyper-threaded it can run only one thread at a time (in case of hyper threaded OS actually sees 2 cores for one physical core and can run two threads - of course it's a bit of cheating and no-where as efficient as having actual physical core). Essentially what it means to end user is that a core can run a single thread so theoretically if I want parallelism using java threads then a reasonably good approximation is number of threads equal to number of core. So if your container process ( which is a JVM)
will require 2 threads then it's better to map it to 2 vcore - that what the last line means. And as total capacity of node the vcore should be equal to number of physical cores.
The most important thing to remember is still that it's actually the OS which will schedule the threads to be executed in different cores as it happens in any other application and
YARN in itself does not have control on it except the fact that what is the best possible approximation for how many thread to allocate for each container. And that's why it is important to take into consideration other applications running on OS, CPU cycles used by kernel etc., as all of cores will not be available to YARN application all the time.
EDIT: Further research
Yarn does not influence hard limits on CPU but Going through the code I can see how it tries to influence the CPU scheduling or cpu rate. Technically Yarn can launch different container processes - java, python , custom shell command etc. The responsibility of launching containers in Yarn belongs to the ContainerExecutor component of Node manager and I can see code for launching the container etc., along with some hints (depending on platform). For example in case of DefaultContainerExecutor ( which extends ContainerExecutor) - for windows it uses "-c" parameter for cpu restriction and on linux it uses process niceness to influence it. There is another implementation LinuxContainerExecutor (or better still CgroupsLCEResourcesHandler as former does not force the usage of cgroups) which tries to use Linux cgroups to limit the Yarn CPU resources on that node. More details can be found here.
ContainerExecutor {
.......
.......
protected String[] getRunCommand(String command, String groupId,
String userName, Path pidFile, Configuration conf, Resource resource) {
boolean containerSchedPriorityIsSet = false;
int containerSchedPriorityAdjustment =
YarnConfiguration.DEFAULT_NM_CONTAINER_EXECUTOR_SCHED_PRIORITY;
if (conf.get(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY) !=
null) {
containerSchedPriorityIsSet = true;
containerSchedPriorityAdjustment = conf
.getInt(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY,
YarnConfiguration.DEFAULT_NM_CONTAINER_EXECUTOR_SCHED_PRIORITY);
}
if (Shell.WINDOWS) {
int cpuRate = -1;
int memory = -1;
if (resource != null) {
if (conf
.getBoolean(
YarnConfiguration.NM_WINDOWS_CONTAINER_MEMORY_LIMIT_ENABLED,
YarnConfiguration.DEFAULT_NM_WINDOWS_CONTAINER_MEMORY_LIMIT_ENABLED)) {
memory = resource.getMemory();
}
if (conf.getBoolean(
YarnConfiguration.NM_WINDOWS_CONTAINER_CPU_LIMIT_ENABLED,
YarnConfiguration.DEFAULT_NM_WINDOWS_CONTAINER_CPU_LIMIT_ENABLED)) {
int containerVCores = resource.getVirtualCores();
int nodeVCores = conf.getInt(YarnConfiguration.NM_VCORES,
YarnConfiguration.DEFAULT_NM_VCORES);
// cap overall usage to the number of cores allocated to YARN
int nodeCpuPercentage = Math
.min(
conf.getInt(
YarnConfiguration.NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT,
YarnConfiguration.DEFAULT_NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT),
100);
nodeCpuPercentage = Math.max(0, nodeCpuPercentage);
if (nodeCpuPercentage == 0) {
String message = "Illegal value for "
+ YarnConfiguration.NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT
+ ". Value cannot be less than or equal to 0.";
throw new IllegalArgumentException(message);
}
float yarnVCores = (nodeCpuPercentage * nodeVCores) / 100.0f;
// CPU should be set to a percentage * 100, e.g. 20% cpu rate limit
// should be set as 20 * 100. The following setting is equal to:
// 100 * (100 * (vcores / Total # of cores allocated to YARN))
cpuRate = Math.min(10000,
(int) ((containerVCores * 10000) / yarnVCores));
}
}
return new String[] { Shell.WINUTILS, "task", "create", "-m",
String.valueOf(memory), "-c", String.valueOf(cpuRate), groupId,
"cmd /c " + command };
} else {
List<String> retCommand = new ArrayList<String>();
if (containerSchedPriorityIsSet) {
retCommand.addAll(Arrays.asList("nice", "-n",
Integer.toString(containerSchedPriorityAdjustment)));
}
retCommand.addAll(Arrays.asList("bash", command));
return retCommand.toArray(new String[retCommand.size()]);
}
}
}
For windows (it utilizes winutils.exe) , it uses cpu rate
For Linux it uses niceness as a parameter to control the CPU priority

"Virtual cores" are merely an abstraction of actual cores. This abstraction or "lie" (as i like to call it), allows YARN (and others) to dynamically spin threads (parallel process) based on availability. Take for example running map reduce on an "elastic" cluster with a processing limit constrained only by your wallet... The cloud baby... The. Cloud.
you can read more here

Related

Memory loads experience different latency on the same core

I am trying to implement a cache based covert channel in C but noticed something weird. The physical address between the sender and the receiver is shared by using the mmap() call that maps to the same file with the MAP_SHARED option. Below is the code for the sender process which flushes an address from the cache to transmit a 1 and loads an address into the cache to transmit a 0. It also measures the latency of a load in both cases:
// computes latency of a load operation
static inline CYCLES load_latency(volatile void* p) {
CYCLES t1 = rdtscp();
load = *((int *)p);
CYCLES t2 = rdtscp();
return (t2-t1);
}
void send_bit(int one, void *addr) {
if(one) {
clflush((void *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
clflush((void *)addr);
}
else {
x = *((int *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
}
}
int main(int argc, char **argv) {
if(argc == 2)
{
bit = atoi(argv[1]);
}
// transmit bit
init_address(DEFAULT_FILE_NAME);
send_bit(bit, address);
return 0;
}
The load operation takes around 0 - 1000 cycles (during a cache-hit and a cache-miss) when issued by the same process.
The receiver program loads the same shared physical address and measures the latency during a cache-hit or a cache-miss, the code for which has been shown below:
int main(int argc, char **argv) {
init_address(DEFAULT_FILE_NAME);
rdtscp();
load__latency = load_latency((void *)address);
printf("load latency = %d\n", load__latency);
return 0;
}
(I ran the receiver manually after the sender process terminated)
However, the latency observed in this scenario is very much different as compared to the first case. The load operation takes around 5000-1000 cycles.
Both the processes have been pinned to the same core-id by using the taskset command. So if I'm not wrong, during a cache-hit, both processes will experience a load latency of the L1-cache on a cache-hit and DRAM on a cache-miss. Yet, these two processes experience a very different latency. What could be the reason for this observation, and how can I have both the processes experience the same amount of latency?

The initial access to an mmaped region will page-fault (lazy mapping/allocation by the kernel), unless you use mmap(MAP_POPULATE), or mlock, or touch some other cache line of the page first.
You're probably timing a page fault if you only do one time measurement per mmap, or per run of a whole program.
(Also you don't seem to be doing anything to warm up the CPU frequency, so once core cycle could be many reference cycles. Some of the time for an L3 miss is fixed in terms of memory clock cycles, but another part of it scales with core/uncore clock.)
Also note that unless you run the 2nd process immediately (e.g. from the same shell command), the OS will get a chance to put that core into a deep sleep. On Intel CPUs at least, that empties L1d and L2 so it can power them down in the deeper C states. Probably also the TLBs.
It's also strange that you you cast away volatile in load = *((int *)p);
Assigning the load result to a global(?) variable inside the timed region is also pretty questionable; that could also soft page fault. If so, RDTSCP will have to wait for it, because the store can't retire.
(But on TLB hit for the store, it doesn't have to wait for it commit to cache, since there's no mfence before rdtscp. A store can retire while the store is still in the store buffer. In fact it must retire before the store buffer entry is known to be non-speculative so it can commit.)

Kafka Stream BoundedMemoryRocksDBConfig

I'm trying to understand how the internals of Kafka Streams works with respects to cache and RocksDB (state store).
KTable<Windowed<EligibilityKey>, String> kTable = kStreamMapValues
.groupByKey(Grouped.with(keySpecificAvroSerde, Serdes.String())).windowedBy(timeWindows)
.reduce((a, b) -> b, materialized.withLoggingDisabled().withRetention(Duration.ofSeconds(retention)))
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(timeToWaitForMoreEvents),
Suppressed.BufferConfig.unbounded().withLoggingDisabled()));
With the above portion of my topology, I'm consuming from a Kafka topic with 300 partitions. The application is deployed on OpenShift with a memory allocation of 4GB. I noticed the memory of the application constantly increasing until eventually an OOMKILLED occurs. After some research I've read that a custom RocksDB config was something I should implement, because the default size is too big for my application. Records first enter a cache (configured by CACHE_MAX_BYTES_BUFFERING_CONFIG and COMMIT_INTERVAL_MS_CONFIG) and then enter a state store.
public class BoundedMemoryRocksDBConfig implements RocksDBConfigSetter {
private static org.rocksdb.Cache cache = new org.rocksdb.LRUCache(1 * 1024 * 1024L, -1, false, 0);
private static org.rocksdb.WriteBufferManager writeBufferManager = new org.rocksdb.WriteBufferManager(1 * 1024 * 1024L, cache);
#Override
public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
BlockBasedTableConfig tableConfig = (BlockBasedTableConfig) options.tableFormatConfig();
// These three options in combination will limit the memory used by RocksDB to the size passed to the block cache (TOTAL_OFF_HEAP_MEMORY)
tableConfig.setBlockCache(cache);
tableConfig.setCacheIndexAndFilterBlocks(true);
options.setWriteBufferManager(writeBufferManager);
// These options are recommended to be set when bounding the total memory
tableConfig.setCacheIndexAndFilterBlocksWithHighPriority(true);
tableConfig.setPinTopLevelIndexAndFilter(true);
tableConfig.setBlockSize(2048L);
options.setMaxWriteBufferNumber(2);
options.setWriteBufferSize(1 * 1024 * 1024L);
options.setTableFormatConfig(tableConfig);
}
#Override
public void close(final String storeName, final Options options) {
// Cache and WriteBufferManager should not be closed here, as the same objects are shared by every store instance.
}
}
With each time window segment, three segments are created by default. If I'm consuming from 300 partitions, since 3 time window segments will be created for each, 900 instances of RocksDB are created. Is my understanding correct that the following is true?
Memory allocated in OpenShift / RocksDB instances => 4096MB / 900 => 4.55 MB
(WriteBufferSize * MaxWriteBufferNumber) + BlockCache + WriteBufferManager => (1MB * 2) + 1MB + 1MB => 4MB
Is the BoundedMemoryRocksDBConfig.java for each instance of RocksDB, or for all?

If you consume from a topic with 300 partitions and you use a segmented state store, i.e., you use a time window in the DSL, you will end up with 900 RocksDB instances. If you only use one Kafka Streams client, i.e., you do not scale out, all 900 RocksDB instances will end up on the same computing node.
The BoundedMemoryRocksDBConfig limits the memory RocksDB uses per Kafka Streams client. That means, if you only use one Kafka Streams client, BoundedMemoryRocksDBConfig limits the memory for all 900 instances.
Is my understanding correct that the following is true?
Memory allocated in OpenShift / RocksDB instances => 4096MB / 900 => 4.55 MB
(WriteBufferSize * MaxWriteBufferNumber) + BlockCache + WriteBufferManager => (1MB * 2) + 1MB + 1MB => 4MB
No, that is not correct.
If you pass the Cache to WriteBufferManager, the size needed for memtables are also counted against the cache (see footnote 1 in the docs of the BoundedMemoryRocksDBConfig and the RocksDB docs). So, the size that you pass to the cache is the limit for memtables and block cache. Since you pass the cache and the write buffer manager to all of your instances on the same computing node, all 900 instances are bounded by the size you pass to the cache. For example, if you specify a size of 4 GB the total memory used by all 900 instances (assuming one Kafka Streams client) is bounded to 4 GB.
Be aware that the size passed to the cache is not a strict limit. Although the boolean parameter in the constructor of the cache gives you the option to enforce a strict limit, the enforcement does not work if the write buffer memory is also counted against the cache due to a bug in the RocksDB version Kafka Streams uses.
With Kafka Streams 2.7.0, you will have the possibility to monitor RocksDB memory consumption with metrics that are exposed via JMX. Have a look at KIP-607 for more details.

Customize task resources on Airflow using MesosExecutor

Is it possible to specify resources (CPU, memory, GPU, disk space) for each operator of a DAG when using MesosExecutor?
I know you can specify global values for resources of a task.
For instance, I have several operators that are CPU expensive and others that not. I would like to execute one at a time of the first, but many in parallel of the non CPU expensive ones.

From the code (mesos_executor.py line 67), it seems that is not possible since cpu and memory values are passed to the Scheduler during initialization:
def __init__(self,
task_queue,
result_queue,
task_cpu=1,
task_mem=256):
self.task_queue = task_queue
self.result_queue = result_queue
self.task_cpu = task_cpu
self.task_mem = task_mem
and those values are used without modification:
cpus = task.resources.add()
cpus.name = "cpus"
cpus.type = mesos_pb2.Value.SCALAR
cpus.scalar.value = self.task_cpu
mem = task.resources.add()
mem.name = "mem"
mem.type = mesos_pb2.Value.SCALAR
mem.scalar.value = self.task_mem
It requires a custom Executor implementation to achieve that

why does the linux kernel thread hog up cpu

I have created a kernel thread using kthread_run in a kernel module.
The thread is very simple, just like bellow.
static int my_thread_func(void * data)
{
int a;
DBG_PRINT("policy:%lu; prio:%d", current->policy, current->prio);
while (!kthread_should_stop())
{
a++;
}
}
However, after I loaded the module, the system did not response any more.
So I wonder what's the schedule policy and priority of this kernel thread.
Then I try to print out the schedule policy and priority of this kernel thread,
and got bellow output.
policy:0; prio:120
policy:0 means SCHED_NORMAL;
prio:120 this is also not high.
While the thread does not have a SCHED_FIFO or SCHED_RR schedule policy, Why can it hog up the cpu?
And I also found that if I insert some sleep code in the loop body of the thread, the system could remain responsive.
And I also found when I run a userspace program implemented as bellow, the system remained responsive, too.
int main(int argc, char *argv[])
{
int a;
while (1) a++;
return 0;
}
So who can tell me, why the kernel thread could hog up the cpu.

When you say the priority is 120, are you observing the priority of kthreadd, or the actual kernel thread that was created for you?
Please see http://lxr.free-electrons.com/source/kernel/kthread.c#L310 for the function that is used to create new kernel threads.
Excerpt:
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data, int node,
const char namefmt[],
...)
{
...
if (!IS_ERR(task)) {
static const struct sched_param param = { .sched_priority = 0 };
...
/*
* root may have changed our (kthreadd's) priority or CPU mask.
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
set_cpus_allowed_ptr(task, cpu_all_mask);
}
...
}
It appears that the sched_priority is set to 0.
Now, please look at http://lxr.free-electrons.com/source/include/linux/sched/prio.h#L9
Excerpt:
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
* tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
* values are inverted: lower p->prio value means higher priority.
*
* The MAX_USER_RT_PRIO value allows the actual maximum
* RT priority to be separate from the value exported to
* user-space. This allows kernel threads to set their
* priority to a value higher than any user task. Note:
* MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
*/
Please note the second paragraph: This allows kernel threads to set their priority to a value higher than any user task.
Summary:
It appears that newly created kernel threads are set to sched_priority of 0 even with SCHED_NORMAL. Priority of 0 is not a normal priority for SCHED_NORMAL, and thus the kernel thread will take priority over any other thread (that isn't using an RT policy---and I don't think any processes in the kernel use RT by default).
Addendum:
Note: I am not 100% sure if this is the reason. But, if you look at the comments in the kernel for kernel threads, they all seem to imply that a kernel thread keeps running UNLESS:
The kernel thread itself yields or calls do_exit.
Someone else calls kthread_should_stop().
Which to me, sounds like the kernel thread runs as long as it wants, until it decides to stop, or someone else explicitly tells it to stop.

Akka actors and Clustering-I'm having trouble with ClusterSingletonManager- unhandled event in state Start

I've got a system that uses Akka 2.2.4 which creates a bunch of local actors and sets them as the routees of a Broadcast Router. Each worker handles some segment of the total work, according to some hash range we pass it. It works great.
Now, I've got to cluster this application for failover. Based on the requirement that only one worker per hash range exist/be triggered on the cluster, it seems to me that setting up each one as a ClusterSingletonManager would make sense..however I'm having trouble getting it working. The actor system starts up, it creates the ClusterSingletonManager, it adds the path in the code cited below to a Broadcast Router, but it never instantiates my actual worker actor to handle my messages for some reason. All I get is a log message: "unhandled event ${my message} in state Start". What am I doing wrong? Is there something else I need to do to start up this single instance cluster? Am I sending the wrong actor a message?
here's my akka config(I use the default config as a fallback):
akka{
cluster{
roles=["workerSystem"]
min-nr-of-members = 1
role {
workerSystem.min-nr-of-members = 1
}
}
daemonic = true
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "127.0.0.1"
port = ${akkaPort}
}
}
actor{
provider = akka.cluster.ClusterActorRefProvider
single-message-bound-mailbox {
# FQCN of the MailboxType. The Class of the FQCN must have a public
# constructor with
# (akka.actor.ActorSystem.Settings, com.typesafe.config.Config) parameters.
mailbox-type = "akka.dispatch.BoundedMailbox"
# If the mailbox is bounded then it uses this setting to determine its
# capacity. The provided value must be positive.
# NOTICE:
# Up to version 2.1 the mailbox type was determined based on this setting;
# this is no longer the case, the type must explicitly be a bounded mailbox.
mailbox-capacity = 1
# If the mailbox is bounded then this is the timeout for enqueueing
# in case the mailbox is full. Negative values signify infinite
# timeout, which should be avoided as it bears the risk of dead-lock.
mailbox-push-timeout-time = 1
}
worker-dispatcher{
type = PinnedDispatcher
executor = "thread-pool-executor"
# Throughput defines the number of messages that are processed in a batch
# before the thread is returned to the pool. Set to 1 for as fair as possible.
throughput = 500
thread-pool-executor {
# Keep alive time for threads
keep-alive-time = 60s
# Min number of threads to cap factor-based core number to
core-pool-size-min = ${workerCount}
# The core pool size factor is used to determine thread pool core size
# using the following formula: ceil(available processors * factor).
# Resulting size is then bounded by the core-pool-size-min and
# core-pool-size-max values.
core-pool-size-factor = 3.0
# Max number of threads to cap factor-based number to
core-pool-size-max = 64
# Minimum number of threads to cap factor-based max number to
# (if using a bounded task queue)
max-pool-size-min = ${workerCount}
# Max no of threads (if using a bounded task queue) is determined by
# calculating: ceil(available processors * factor)
max-pool-size-factor = 3.0
# Max number of threads to cap factor-based max number to
# (if using a bounded task queue)
max-pool-size-max = 64
# Specifies the bounded capacity of the task queue (< 1 == unbounded)
task-queue-size = -1
# Specifies which type of task queue will be used, can be "array" or
# "linked" (default)
task-queue-type = "linked"
# Allow core threads to time out
allow-core-timeout = on
}
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1
# The parallelism factor is used to determine thread pool size using the
# following formula: ceil(available processors * factor). Resulting size
# is then bounded by the parallelism-min and parallelism-max values.
parallelism-factor = 3.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 1
}
}
}
}
Here's where I create my Actors(its' written in Groovy):
Props clusteredProps = ClusterSingletonManager.defaultProps("worker".toString(), PoisonPill.getInstance(), "workerSystem",
new ClusterSingletonPropsFactory(){
#Override
Props create(Object handOverData) {
log.info("called in ClusterSingetonManager")
Props.create(WorkerActorCreator.create(applicationContext, it.start, it.end)).withDispatcher("akka.actor.worker-dispatcher").withMailbox("akka.actor.single-message-bound-mailbox")
}
} )
ActorRef manager = system.actorOf(clusteredProps, "worker-${it.start}-${it.end}".toString())
String path = manager.path().child("worker").toString()
path
when I try to send a message to the actual worker actor, should the path above resolve? Currently it does not.
What am I doing wrong? Also, these actors live within a Spring application, and the worker actors are set up with some #Autowired dependencies. While this Spring integration worked well in a non-clustered environment, are there any gotchyas in a clustered environment I should be looking out for?
thank you
FYI:I've also posted this in the akka-user google group. Here's the link.

The path in your code is to the ClusterSingletonManager actor that you start on each node with role "workerSystem". It will create a child actor (WorkerActor) with name "worker-${it.start}-${it.end}" on the oldest node in the cluster, i.e. singleton within the cluster.
You should also define the name of the ClusterSingletonManager, e.g. system.actorOf(clusteredProps, "workerSingletonManager").
You can't send the messages to the ClusterSingletonManager. You must send them to the path of the active worker, i.e. including the address of the oldest node. That is illustrated by the ConsumerProxy in the documentation.
I'm not sure you should use a singleton at all for this. All workers will be running on the same node, the oldest. I would prefer to discuss alternative solutions to your problem at the akka-user google group.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

What does virtual core in YARN vcore mean? - hadoop

Yarn is using the concept of virtual core to manage CPU resources. I would ask what's the benefit to use virtual core, is there some reason here that YARN uses vcore?

Related

Memory loads experience different latency on the same core

Kafka Stream BoundedMemoryRocksDBConfig

Customize task resources on Airflow using MesosExecutor

why does the linux kernel thread hog up cpu

Akka actors and Clustering-I'm having trouble with ClusterSingletonManager- unhandled event in state Start

Categories

Resources