Ray does not distribute over all availble cpus within cluster - amazon-ec2

The problem I have is that ray will not distribute over my workers
I collectively I have 16 cores, as I have 8cpus on each ec2 aws instance of ubuntu.
However when I launch my ray cluster, and submit my python script it only distributes over 8 cores as only 8 pids are showing to be utilized.
its also noteworthy than im unable to access the ray dashboard on the EC2 instance, I only have this information by printing the pids being used.
How do I get my script to be utlized by all 16 cpus and therefore show 16pids being used to execute the script?
This is my script:
import os
import ray
import time
import xgboost
from xgboost.sklearn import XGBClassifier
def printer():
print("INSIDE WORKER " + str(time.time()) +" PID : "+ str(os.getpid()))
# decorators allow for futures to be created for parallelization
#ray.remote
def func_1():
#model = XGBClassifier()
count = 0
for i in range(100000000):
count += 1
printer()
return count
#ray.remote
def func_2():
#model = XGBClassifier()
count = 0
for i in range(100000000):
count += 1
printer()
return count
#ray.remote
def func_3():
count = 0
for i in range(100000000):
count += 1
printer()
return count
def main():
#model = XGBClassifier()
start = time.time()
results = []
ray.init(address='auto')
#append fuction futures
for i in range(10):
results.append(func_1.remote())
results.append(func_2.remote())
results.append(func_3.remote())
#run in parrallel and get aggregated list
a = ray.get(results)
b = 0
#add all values in list together
for j in range(len(a)):
b += a[j]
print(b)
#time to complete
end = time.time()
print(end - start)
if __name__ == '__main__':
main()
This is my config:
# A unique identifier for the head node and workers of this cluster.
cluster_name: basic-ray-123454
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers defaults to 0.
max_workers: 2 # this means zero workers
min_workers: 2 # this means zero workers
# Cloud-provider specific configuration.
provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
file_mounts_sync_continuously: False
auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxa6b31fd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxx26a6b31fd2c
KeyName: aws_ubuntu_test
file_mounts: {
"/home/ubuntu": "/home/user/RAY_AWS_DOCKER/ray_example_2_4/conda_env.yaml"
}
setup_commands:
- echo "start initialization_commands"
- sudo apt-get update
- sudo apt-get upgrade
- sudo apt-get install -y python-setuptools
- sudo apt-get install -y build-essential curl unzip psmisc
- pip install --upgrade pip
- pip install ray[all]
- echo "all files :"
- ls
# - conda install -c conda-forge xgboost
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Related

Strapi on Plesk Windows, port redirect error

I want to set up Strapi (v4) on a Plesk (Windows) / nodeJS (16.16) / MySQL (MariaDb 10.6.9) environment but I get the error: options.port should be >= 0 and < 65536. Received NaN.
Strapi Version: 4
Operating System: Windows 2022 Datacenter + Plesk Obsidian
Database: MariaDB 10.6.9
Node Version: 16.16
NPM Version: 8.19.2
Yarn Version: N/A
I did the installation like this :
$ npx create-strapi-app plesk-strapi --no-run
$ # custom Install
$ # wait for installation process to finish
$ cd strapi-demo.xyz.com
$ npm run build
I configure the boot of the application at the root of the project ‘/app.js’
const strapi = require('#strapi/strapi'); //require('strapi');
strapi().start();
And file ‘config/server.js’ :
module.exports = ({ env }) => ({
host: env('HOST', process.env.HOST || '127.0.0.1'),
port: env.int('PORT', parseInt(process.env.PORT) || 80),
app: {
keys: env.array('APP_KEYS'),
},
});
And file ‘/.env’ :
HOST=127.0.0.1
PORT=80
APP_KEYS= mySecretKey1, mySecretKey2, mySecretKey3, mySecretKey4
API_TOKEN_SALT= mySecretSaltKey
ADMIN_JWT_SECRET=mySecretKey
JWT_SECRET=mySecretKey
# `sqlite` or `mysql`
DATABASE_CONNECTION_NAME=mysql
DATABASE_HOST=localhost
DATABASE_PORT=3306
DATABASE_NAME=myDbName
DATABASE_USERNAME=MyDbUsername
DATABASE_PASSWORD=MyDbPassword
Plesk nodeJS config :
return Error :
[2022-10-20 09:33:55.897] [34mdebug[39m: ⛔️ Server wasn't able to start properly.
[2022-10-20 09:33:55.904] [31merror[39m: options.port should be >= 0 and < 65536. Received NaN.
RangeError [ERR_SOCKET_BAD_PORT]: options.port should be >= 0 and < 65536. Received NaN.
at new NodeError (node:internal/errors:372:5)
at validatePort (node:internal/validators:217:11)
at Server.listen (node:net:1500:5)
at Object.listen (E:\plesk\vhost\xyz.com\strapi-demo.xyz.com\node_modules\#strapi\strapi\lib\services\server\index.js:122:25)
at E:\plesk\vhost\xyz.com\strapi-demo.xyz.com\node_modules\#strapi\strapi\lib\Strapi.js:302:21
at new Promise (<anonymous>)
at Strapi.listen (E:\plesk\vhost\xyz.com\strapi-demo.xyz.com\node_modules\#strapi\strapi\lib\Strapi.js:281:12)
at Strapi.start (E:\plesk\vhost\xyz.com\strapi-demo.xyz.com\node_modules\#strapi\strapi\lib\Strapi.js:215:18)
at processTicksAndRejections (node:internal/process/task_queues:96:5)

Possible reasons for groovy program running as kubernetes job dumping threads during execution

I have a simple groovy script that leverages the GPars library's withPool functionality to launch HTTP GET requests to two internal API endpoints in parallel.
The script runs fine locally, both directly as well as a docker container.
When I deploy it as a Kubernetes Job (in our internal EKS cluster: 1.20), it runs there as well, but the moment it hits the first withPool call, I see a giant thread dump, but the execution continues, and completes successfully.
NOTE: Containers in our cluster run with the following pod security context:
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
Environment
# From the k8s job container
groovy#app-271df1d7-15848624-mzhhj:/app$ groovy --version
WARNING: Using incubator modules: jdk.incubator.foreign, jdk.incubator.vector
Groovy Version: 4.0.0 JVM: 17.0.2 Vendor: Eclipse Adoptium OS: Linux
groovy#app-271df1d7-15848624-mzhhj:/app$ ps -ef
UID PID PPID C STIME TTY TIME CMD
groovy 1 0 0 21:04 ? 00:00:00 /bin/bash bin/run-script.sh
groovy 12 1 42 21:04 ? 00:00:17 /opt/java/openjdk/bin/java -Xms3g -Xmx3g --add-modules=ALL-SYSTEM -classpath /opt/groovy/lib/groovy-4.0.0.jar -Dscript.name=/usr/bin/groovy -Dprogram.name=groovy -Dgroovy.starter.conf=/opt/groovy/conf/groovy-starter.conf -Dgroovy.home=/opt/groovy -Dtools.jar=/opt/java/openjdk/lib/tools.jar org.codehaus.groovy.tools.GroovyStarter --main groovy.ui.GroovyMain --conf /opt/groovy/conf/groovy-starter.conf --classpath . /tmp/script.groovy
groovy 116 0 0 21:05 pts/0 00:00:00 bash
groovy 160 116 0 21:05 pts/0 00:00:00 ps -ef
Script (relevant parts)
#Grab('org.codehaus.gpars:gpars:1.2.1')
import static groovyx.gpars.GParsPool.withPool
import groovy.json.JsonSlurper
final def jsl = new JsonSlurper()
//...
while (!(nextBatch = getBatch(batchSize)).isEmpty()) {
def devThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = dev + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
devResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
def stgThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = stg + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
stgResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
devThread.join()
stgThread.join()
}
Dockerfile
FROM groovy:4.0.0-jdk17 as builder
USER root
RUN apt-get update && apt-get install -yq bash curl wget jq
WORKDIR /app
COPY bin /app/bin
RUN chmod +x /app/bin/*
USER groovy
ENTRYPOINT ["/bin/bash"]
CMD ["bin/run-script.sh"]
The bin/run-script.sh simply downloads the above groovy script at runtime and executes it.
wget "$GROOVY_SCRIPT" -O "$LOCAL_FILE"
...
groovy "$LOCAL_FILE"
As soon as the execution hits the first call to withPool(poolSize), there's a giant thread dump, but execution continues.
I'm trying to figure out what could be causing this behavior. Any ideas 🤷🏽‍♂️?
Thread dump
For posterity, answering my own question here.
The issue turned out to be this log4j2 JVM hot-patch that we're currently leveraging to fix the recent log4j2 vulnerability. This agent (running as a DaemonSet) patches all running JVMs in all our k8s clusters.
This, somehow, causes my OpenJDK 17 based app to thread dump. I found the same issue with an ElasticSearch 8.1.0 deployment as well (also uses a pre-packaged OpenJDK 17). This one is a service, so I could see a thread dump happening pretty much every half hour! Interestingly, there are other JVM services (and some SOLR 8 deployments) that don't have this issue 🤷🏽‍♂️.
Anyway, I worked with our devops team to temporarily exclude the node that deployment was running on, and lo and behold, the thread dumps disappeared!
Balance in the universe has been restored 🧘🏻‍♂️.

Nomad Job - Failed to place all allocations

I’m trying to deploy an AWS EBS volume via nomad but I’m this below error. How do I resolve it?
$ nomad job plan -var-file bambootest.vars bamboo2.nomad
+/- Job: “bamboo2”
+/- Stop: “true” => “false”
+/- Task Group: “main” (1 create)
Volume {
AccessMode: “single-node-writer”
AttachmentMode: “file-system”
Name: “bambootest”
PerAlloc: “false”
ReadOnly: “false”
Source: “bambootest”
Type: “csi”
}
Task: “web”
Scheduler dry-run:
WARNING: Failed to place all allocations.
Task Group “main” (failed to place 1 allocation):
Class “system”: 3 nodes excluded by filter
Class “svt”: 2 nodes excluded by filter
Class “devtools”: 2 nodes excluded by filter
Class “bambootest”: 2 nodes excluded by filter
Class “ambt”: 2 nodes excluded by filter
Constraint “${meta.namespace} = bambootest”: 9 nodes excluded by filter
Constraint “missing CSI Volume bambootest”: 2 nodes excluded by filter
Below is an excerpt of the volume block that seems to be the problem.
group main {
count = 1
volume "bambootest" {
type = "csi"
source = "bambootest"
read_only = false
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
task web {
driver = "docker"

Conda using latest packages of certain date

The title is self-explanatory. Is there a way to downgrade the conda packages to the ones that were the latest on a certain date?
This is not possible programmatically. Packages in Conda are specified through MatchSpec, which does not currently have any way to constrain on a build timestamp.
Manual Searching
When searching for packages via conda search, the --info flag will print the build timestamps if they are available. So, for example, if one wanted to find the latest version of PyMC3 that someone with Python 3.6 was running a year ago (9 Dec 2018), one could check
conda search --info 'conda-forge::pymc3'
and see that version 3.5, build py36_1000 would satisfy this. If one wanted to create an env with this build in it, they could use
conda create -n py36_pymc35 -c conda-forge pymc3=3.5=py36_1000
2023 Update
In addition to Merv's post, I may add that the --json flag makes it actually quite easy to programmatically gather the history. Once you have the history, you can search for the latest package versions as of some date, and make an environment with them (we do that routinely to establish "low watermark" environments for our CIs).
The conda command line invocation is:
f'conda search -q {package} --info --json`
Here is some code that uses that to gather the history of a few packages. It is also multi-threaded to speed up things a little.
import io
import json
import subprocess
import yaml
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime, timedelta
from tqdm import tqdm
def shell(cmd):
proc = subprocess.run(cmd, shell=True, capture_output=True)
return proc.stdout.decode('utf-8')
def version_as_tuple(v):
return tuple(map(int, v.split('.')))
def get_history(p):
txt = shell(f"conda search -q {p} --info --json")
d = json.loads(txt)
h = defaultdict(set)
for vv in d.values():
for x in vv:
h[version_as_tuple(x['version'])].add(
datetime.fromtimestamp(x.get('timestamp', 0) / 1e3)
)
h = {vers: min(dates) for vers, dates in h.items()}
return p, h
Example usage:
metayaml = """
- boto3
- pandas >=0.25
- python >=3.8
"""
reqs = yaml.safe_load(metayaml) # in real life, read from conda.recipe/meta.yaml
all_pkgs = sorted(set([p.split()[0] for p in reqs]))
with ThreadPoolExecutor() as pool:
history = dict(tqdm(pool.map(get_history, all_pkgs), total=len(all_pkgs)))
After that, we have a neat version history for all dependent packages. For example:
>>> {v: f'{t:%Y-%m-%d}' for v, t in history['pandas'].items()}
{(0, 20, 3): '2017-09-18',
(0, 21, 0): '2017-11-06',
(0, 21, 1): '2017-12-12',
...
(1, 4, 4): '2022-09-21',
(1, 5, 1): '2022-11-16',
(1, 5, 2): '2022-12-07'}
And:
asof = datetime.now() - timedelta(weeks=2*52)
new = {
name: max([(vers, t) for vers, t in v.items() if t < asof])
for name, v in history.items()
}
print(f'# as of {asof:%Y-%m-%d}')
for name, (vers, t) in new.items():
print(f' - {name} =={".".join(map(str, vers))} # released on {t:%Y-%m-%d}')
Which produces:
# as of 2021-01-20
- boto3 ==1.16.55 # released on 2021-01-15
- pandas ==1.2.0 # released on 2020-12-26
- python ==3.9.1 # released on 2020-12-11

MapR installation failing for single node cluster

I was referring quick installation guide for single node cluster. For this i used 20GB storage file for MaprFS but while on installation , it is giving ' Unable to find disks: /maprfs/storagefile' .
Here is my configuration file.
# Each Node section can specify nodes in the following format
# Hostname: disk1, disk2, disk3
# Specifying disks is optional. If not provided, the installer will use the values of 'disks' from the Defaults section
[Control_Nodes]
maprlocal.td.td.com: /maprfs/storagefile
#control-node2.mydomain: /dev/disk3, /dev/disk9
#control-node3.mydomain: /dev/sdb, /dev/sdc, /dev/sdd
[Data_Nodes]
#data-node1.mydomain
#data-node2.mydomain: /dev/sdb, /dev/sdc, /dev/sdd
#data-node3.mydomain: /dev/sdd
#data-node4.mydomain: /dev/sdb, /dev/sdd
[Client_Nodes]
#client1.mydomain
#client2.mydomain
#client3.mydomain
[Options]
MapReduce1 = true
YARN = true
HBase = true
MapR-DB = true
ControlNodesAsDataNodes = true
WirelevelSecurity = false
LocalRepo = false
[Defaults]
ClusterName = my.cluster.com
User = mapr
Group = mapr
Password = mapr
UID = 2000
GID = 2000
Disks = /maprfs/storagefile
StripeWidth = 3
ForceFormat = false
CoreRepoURL = http://package.mapr.com/releases
EcoRepoURL = http://package.mapr.com/releases/ecosystem-4.x
Version = 4.0.2
MetricsDBHost =
MetricsDBUser =
MetricsDBPassword =
MetricsDBSchema =
Below is the error that i am getting.
2015-04-16 08:18:03,659 callbacks 42 [INFO]: Running task: [Verify Pre-Requisites]
2015-04-16 08:18:03,661 callbacks 87 [ERROR]: maprlocal.td.td.com: Unable to find disks: /maprfs/storagefile from /maprfs/storagefile remove disks: /dev/sda,/dev/sda1,/dev/sda2,/dev/sda3 and retry
2015-04-16 08:18:03,662 callbacks 91 [ERROR]: failed: [maprlocal.td.td.com] => {"failed": true}
2015-04-16 08:18:03,667 installrunner 199 [ERROR]: Host: maprlocal.td.td.com has 1 failures
2015-04-16 08:18:03,668 common 203 [ERROR]: Control Nodes have failures. Please fix the failures and re-run the installation. For more information refer to the installer log at /opt/mapr-installer/var/mapr-installer.log
Please help me here.
Thanks
Shashi
Error is resolved by adding skip-check new option after install
/opt/mapr-installer/bin/install --skip-checks new

Resources