I now have a Ray cluster working on EC2 (Ubuntu 16.04) with a c4.8xlarge master node and one identical worker. I wanted to check whether multi-threading was being used, so I ran tests to time increasing numbers (n) of the same 9-second task. Since the instance has 18 CPUs, I expected to see the job taking about 9s for up to n<=35 (assuming one CPU for the cluster management) and then either a fault, or an increase to about 18 sec when switching to 36 vCPUs per node.
Instead, the cluster handled up to only 14 tasks in parallel and then the execution time jumped to 40s and continued to increase for increasing n. When I tried a c4xlarge master (4 CPUs), the times were directly proportional to n, i.e. they were running serially. So I surmise that the master actually requires 4 CPUs for the system, and that the worker node is not being used at all. However, if I add a second worker, the times for n>14 are about 40s less that without it. I also tried a value for target_utilization_factor less than 1.0, but that made no difference.
There were no reported errors, but I did notice that the ray-node-status for the worker in the EC2 Instances console was "update-failed". Is this significant? Can anyone enlighten me about this behaviour?
The cluster did not appear to be using the workers, so the trace is showing only 18 actual cpus dealing with the task. The monitor (ray exec ray_conf.yaml 'tail -n 100 -f /tmp/ray/session_/logs/monitor') identified that the "update-failed" is significant in that the setup commands, called by the ray updater.py, were failing on the worker nodes. Specifically, it was the attempt to install the C build-essential compiler package on them that, presumably, exceeded the worker memory allocation. I was only doing this in order to suppress a "setproctitle" installation warning - which I now understand can be safely ignored anyway.
Related
I'm migrating an e2e test stack from docker compose based setup to kubernetes. As part of this migration, I'm also creating terraform modules for individual services that make up a product.
A single e2e stack - ATM - is composed of ~50 pods and starts up in out 5 minutes (I run dedicated DBs, in-memory data stores, esb integration tools, external mocked services, etc.., per stack, hence the high number of pods).
During testing I would like to start up as many of those complete stacks as possible.
Currently I have a k8s cluster with 9 nodes:
six 64GB RAM, 512GB SSD, with latest gen i5 CPUs (max-pods with default 110) and
three 256GB RAM, 1TB SSD, with 18 core Xeon CPU (max-pods set to 330 on each) nodes.
I'm using my terraform modules to start up the stacks (the modules mostly define kubernetes resources).
My expectation would be that I can fire up ~30 stacks in parallel without major hiccups. I also would expect that startup times follow - to a reasonable extent - a function like st = ((#s + #n) % #n) * ss-st, where:
st denotes the overall startup time,
#s denotes number of stacks,
#n denotes number of nodes and finally
ss-st denotes single stack startup time.
However, reality is very different. Blue columns shows actual startup times (measured in seconds), while red shows, what my idealized expectation is.
Installed Prometheus operator and I have some metrics, but non explain to me (yet) where is the bottleneck in this case (disk utilization seems to get maxed out on master nodes from time to time, but they alone does not seems to explain the end figures).
What am I doing wrong?
I read 8 billion lines from GCS, do processing on each line, then output. My processing step can take a little time and to avoid worker leases expiring and getting below error; I do a GroupByKey on 8 billion and group by id to prevent fusion.
A work item was attempted 4 times without success. Each time the
worker eventually lost contact with the service. The work item was
attempted on:
The problem is GroupByKey step is taking forever to complete for 8 billion lines even on a 1000 high-mem-2 nodes.
I looked into the possible cause of slow processing being; large size of each value generated per key by GroupByKey. I don't think that's is possible because out of 8 billion inputs, one input id cannot be in that set more than 30 times. So clearly the problem of HotKeys is not here, something else is going on.
Any ideas on how to optimize this are appreciated. Thanks.
I did manage to solve this problem. There were a number of incorrect assumptions here on my part about dataflow wall times. I was looking at my pipeline and the step with highest wall time; which was in days, I thought is the bottleneck. But in Apache beam a step is usually fused together with steps downstream in the pipeline, and will only run as fast as the step down the pipeline runs. So a wall time that is significant is not enough to conclude that this step is the bottleneck in the pipeline. The real solution to the problem stated above came from this thread. I reduced the number of nodes my pipeline runs on. And changed node type from high-mem-2 to high-mem-4. I wish there was an easy way to get memory usage metrics for a dataflow pipeline. I had to ssh into VMs and do JMAP.
I am running a spark job with input file of size 6.6G (hdfs) with master as local. My Spark Job with 53 partitions completed quickly when I assign local[6] than local[2], however the individual task takes more computation time when number of cores are more. Say if I assign 1 core(local[1]) then each task takes 3 secs where the same goes up to 12 seconds if I assign 6 cores (local[6]). Where the time gets wasted? The spark UI shows increase in computation time for each task in local[6] case, I couldn't understand the reason why the same code takes different computation time when more cores are assigned.
Update:
I could see more %iowait in iostat output if I use local[6] than local[1]. Please let me know this is the only reason or any possible reasons. I wonder why this iowait is not reported in sparkUI. I see the increase in computing time than iowait time.
I am assuming you are referring to spark.task.cpus and not spark.cores.max
With spark.tasks.cpus each task get assigned more cores, but it doesn't necessarily have to use them. If you process is single threaded it really can't use them. You wind up with additional overhead without additional benefit and those cores are taken away from other single threaded tasks that can use them.
With spark.cores.max it is simply and overhead issue with transferring data around at the same time.
I have driver program that runs a set of 5 experiments - basically the driver program just tells the program which dataset to use (of which there are 5 and they're very similar).
The first iteration takes 3.5 minutes, the second 6 minutes, the third 30 minutes and the fourth has been running for over 30 minutes.
After each run the SparkContext object is stopped, it is then re-started for the next run - I thought this method would prevent slow down, as when sc.stop is called I was under the impression that the instances were cleared of all their RDD data - this is at least how it works in local mode. The dataset is quite small and according to Spark UI only 20Mb of data on 2 nodes is used.
Does sc.stop not remove all data from a node? What would cause such a slow down?
call sc.stop after all iterations are complete. Whenever we stop SparkContenxt and invoke new, it require time to load spark configurations,jars and free driver port to execute the next job.
and
using config --executor-memory you can speed up the process, depending on how much memory you have in each node.
Stupidly, I had used T2 instances. Their burstable performance means they only work on full power for a small amount of time. Read the documentation thoroughly - lesson learnt!
How do you configure AWS autoscaling to scale up quickly? I've setup an AWS autoscaling group with an ELB. All is working well, except it takes several minutes before the new instances are added and are online. I came across the following in a post about Puppet and autoscaling:
The time to scale can be lowered from several minutes to a few seconds if the AMI you use for a group of nodes is already up to date.
http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
Is this true? Can time to scale be reduced to a few seconds? Would using puppet add any performance boosts?
I also read that smaller instances start quicker than larger ones:
Small Instance 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform with a base install of CentOS 5.3 AMI
Amount of time from launch of instance to availability:
Between 5 and 6 minutes us-east-1c
Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform with a base install of CentOS 5.3 AMI
Amount of time from launch of instance to availability:
Between 11 and 18 minutes us-east-1c
Both were started via command line using Amazons tools.
http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-instance
I note that the article is old and my c1.xlarge instances are certainly not taking 18min to launch. Nonetheless, would configuring an autoscale group with 50 micro instances (with an up scale policy of 100% capacity increase) be more efficient than one with 20 large instances? Or potentially creating two autoscale groups, one of micros for quick launch time and one of large instances to add CPU grunt a few minutes later? All else being equal, how much quicker does a t1.micro come online than a c1.xlarge?
you can increase or decrease the time of reaction for an autoscaller by playing with
"--cooldown" value (in seconds).
regarding the types of instances to be used, this is mostly based on the application type and a decision on this topic should be taken after close performance monitor and production tuning.
The time to scale can be lowered from several minutes to a few seconds
if the AMI you use for a group of nodes is already up to date. This
way, when Puppet runs on boot, it has to do very little, if anything,
to configure the instance with the node’s assigned role.
The advice here is talking about having your AMI (The snapshot of your operating system) as up to date as possible. This way, when auto scale brings up a new machine, Puppet doesn't have to install lots of software like it normally would on a blank AMI, it may just need to pull some updated application files.
Depending on how much work your Puppet scripts do (apt-get install, compiling software, etc) this could save you 5-20 minutes.
The two other factors you have to worry about are:
How long it takes your load balancer to determine you need more resources (e.g a policy that dictates "new machines should be added when CPU is above 90% for more then 5 minutes" would be less responsive and more likely to lead to timeouts compared to "new machines should be added when CPU is above 60% for more then 1 minute")
How long it takes to provision a new EC2 instance (smaller Instance Types tend to take shorted times to provision)
How soon ASG responds would depend on 3 things:
1. Step - how much to increase by % or fixed number - a large step - you can rapidly increase. ASG will launch the entire Step in one go
2. Cooldown Period - This applies 'how soon' the next increase can happen. If the previous increase step is still within the defined cooldown period (seconds), ASG will wait and not take action for next increase yet. Having a small cooldown period will enable next Step quicker.
3 AMI type- how much time a AMI takes to launch, this depends on type of AMI - many factors come into play. All things equal Fully Baked AMIs launch much faster