I'm working on a tool for monitoring the jobs currently running on a cluster (19 nodes, 40 cores). Is there any way to determine which specific cpus each job in the slurm queue is using? I'm getting data using 'pidstat', 'mpstat', and 'ps -eFj', that tells me what processes are running on a particular core, but have no way to relate those process IDs to the Job IDs that Slurm uses. 'scontrol show job' gives a lot of information, but not specific cpu allocation. Is there any way to do this?
Heres the code that collects the data:
#!/usr/bin/env python
import subprocess
import threading
import time
def scan():
data = [[None, None, None] for i in range(19)]
def mpstat(node):
if(node == 1):
output = subprocess.check_output(['mpstat', '-P', 'ALL', '1', '1'])
output = subprocess.check_output(['ssh', 'node' + str(node), 'mpstat', '-P', 'ALL', '1', '1'])
data[node - 1][0] = output
def pidstat(node):
if(node == 1):
output = subprocess.check_output(['pidstat', '1', '1'])
output = subprocess.check_output(['ssh', 'node' + str(node), 'pidstat', '1', '1'])
data[node - 1][1] = output
def ps(node):
if(node == 1):
output = subprocess.check_output(['ps', '-eFj'])
output = subprocess.check_output(['ssh', 'node' + str(node), 'ps', '-eFj'])
data[node - 1][2] = output
threads = [[None, None, None] for i in range(19)]
for node in range(1, 19 + 1):
threads[node - 1][0] = threading.Thread(target=mpstat, args=(node,))
threads[node - 1][0].start()
threads[node - 1][1] = threading.Thread(target=pidstat, args=(node,))
threads[node - 1][1].start()
threads[node - 1][2] = threading.Thread(target=ps, args=(node,))
threads[node - 1][2].start()
while True:
alive = [[not t.isAlive() for t in n] for n in threads]
alive = [t for n in alive for t in n]

By using the -d flag you can get the CPU_IDs of the job on each node as shown below.
$ scontrol show job -d $SLURM_JOBID
JobId=1 JobName=bash
UserId=USER(UID) GroupId=GROUP(GID) MCS_label=N/A
Priority=56117 Nice=0 Account=account QOS=interactive
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:10 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2019-04-12T17:34:11 EligibleTime=2019-04-12T17:34:11
StartTime=2019-04-12T17:34:12 EndTime=2019-04-12T19:34:12 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=node2:25638
ReqNodeList=(null) ExcNodeList=(null)
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
Nodes=node1 CPU_IDs=12-13 Mem=17600 GRES_IDX=
MinCPUsNode=2 MinMemoryCPU=8800M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
If this information is not enough you may find useful the output of scontrol pidinfo PID
$ scontrol pidinfo 43734
Slurm job id 21757758 ends at Fri Apr 12 20:15:49 2019
slurm_get_rem_time is 6647


Improve code result speed by multiprocessing

I'm self study of Python and it's my first code.
I'm working for analyze logs from the servers. Usually I need analyze full day logs. I created script (this is example, simple logic) just for check speed. If I use normal coding the duration of analyzing 20mil rows about 12-13 minutes. I need 200mil rows by 5 min.
What I tried:
Use multiprocessing (met issue with share memory, think that fix it). But as the result - 300K rows = 20 sec and no matter how many processes. (PS: Also need control processors count in advance)
Use threading (I found that it's not give any speed, 300K rows = 2 sec. But normal code same, 300K = 2 sec)
Use asyncio (I think that script is slow because need reads many files). Result same as threading - 300K = 2 sec.
Finally I think that all three my script incorrect and didn't work correctly.
PS: I try to avoid use specific python modules (like pandas) because in this case it will be more difficult to execute on different servers. Better to use common lib.
Please help to check 1st - multiprocessing.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a, n):
proc_num = os.getpid()
a_temp_m = a["vod_miss"]
a_temp_h = a["vod_hit"]
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m[n] = a_temp_m[n] + 1
elif j[3].find('HIT') != -1:
a_temp_h[n] = a_temp_h[n] + 1
a["vod_miss"][n] = a_temp_m[n]
a["vod_hit"][n] = a_temp_h[n]
if __name__ == '__main__':
procs = []
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
n = 1
vod_live_cuts[i] = manager.list([0] * cpu)
vod_live_cuts[ii] = manager.list([0] * cpu)
for m in file:
proc = Process(target=argument, args=(m, vod_live_cuts, (n-1)))
if n >= cpu:
n = 1
n += 1
[proc.join() for proc in procs]
[proc.close() for proc in procs]
I'm expect, each file by def argument will be processed by independent process and finally all results will be saved in dict vod_live_cuts. For each process I added independent list in dict. I think it will help cross operation for use this parameter. But maybe it's wrong way :(
using IPC is costly, so only use "shared objects" for saving the final result, not for intermediate results while parsing the file.
limiting the number of processes is done by using a multiprocessing.Pool, the following code uses it to reach the max hard-disk speed, you only need to post-process the results.
you can only parse data as fast as your HDD can read it (typically 30-80 MB/s), so if you need to improve the performance further you should use SSD or RAID0 for higher disk speed, you cannot get much faster than this without changing your hardware.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager, Pool
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a):
proc_num = os.getpid()
a_temp_m_n = 0 # make it local to process
a_temp_h_n = 0 # as shared lists use IPC
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m_n = a_temp_m_n + 1
elif j[3].find('HIT') != -1:
a_temp_h_n = a_temp_h_n + 1
if __name__ == '__main__':
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
vod_live_cuts[i] = manager.list()
vod_live_cuts[ii] = manager.list()
with Pool(cpu) as pool:
tasks = []
for m in file:
task = pool.apply_async(argument, args=(m, vod_live_cuts))
for task in tasks:

Fitting Lightgbm distributed with lgb.train hangs

I'm trying to learn how to use lightgbm distributed.
I wrote a simple hello world kind of code where I use iris dataset with 150 rows, split it into train (100 rows) and test(50 rows). Then training the train test set are further split into two parts. Each part is fed into two machines with appropriate rank.
The problem I see is that lgb.train hangs.
Here is the code:
import argparse
import logging
import lightgbm as lgb
import pandas as pd
from sklearn import datasets
import socket
print('lightgbm', lgb.__version__)
HOST = socket.gethostname()
ip_address = socket.gethostbyname(HOST)
print("IP=", ip_address)
# looks like lightgbm operates only with ip addresses
IPS = ['', '']
assert ip_address in IPS
logger = logging.getLogger(__name__)
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 10000)
pd.set_option('max_colwidth', 100)
pd.set_option('precision', 5)
def read_train_data(rank):
iris = datasets.load_iris()
iris_df = pd.DataFrame(, columns=iris.feature_names)
partition = rank
assert partition < 2
separate = 100
train_df = iris_df.iloc[:separate]
test_df = iris_df.iloc[separate:]
separate_train = 60
separate_test = 30
if partition == 0:
train_df = train_df.iloc[:separate_train]
test_df = test_df.iloc[:separate_test]
train_df = train_df.iloc[separate_train:]
test_df = test_df.iloc[separate_test:]
def get_lgb_dataset(df):
target_column = df.columns[-1]
columns = df.columns[:-1]
assert target_column not in columns
print('Target column', target_column)
x = df[columns]
y = df[target_column]
ds = lgb.Dataset(free_raw_data=False, data=x, label=y, params={
"enable_bundle": False
return ds
dtrain = get_lgb_dataset(train_df)
dtest = get_lgb_dataset(test_df)
return dtrain, dtest
def train(args):
port0 = 56456
rank = IPS.index(ip_address)
print("Rank=", rank, HOST)
print("RR", rank)
dtrain, dtest = read_train_data(rank=rank)
params = {'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.1,
'max_depth': 2,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 1,
'num_leaves': 31,
'objective': 'regression',
'metric': 'rmse',
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': False,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'tree_learner': 'data_parallel',
'num_threads': 48,
'machines': ','.join([f'{machine}:{port0}' for i, machine in enumerate(IPS)]),
'local_listen_port': port0,
'time_out': 120,
'num_machines': len(IPS)
print(params)"starting to train lgb at node with rank %d", rank)
evals_result = {}
if args.scikit == 1:
print("Using scikit learn")
bst = lgb.sklearn.LGBMRegressor(**params),
eval_set=[(, dtest.label)],
print("Using regular LGB")
bst = lgb.train(params,
print(evals_result)"finish xgboost training at node with rank %d", rank)
return bst
def main(args):"starting the train job")
model = train(args)
pd.set_option('display.max_rows', 500)
print("OUT", model.__class__)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
I can run it with the scikit fit interface by running: python --scikit 1
On the two machines. It produces a reasonable result.
However, when I use -- scikit 0 (which uses lgb.train), then fitting just hangs on both nodes. Last messages before it hangs:
[LightGBM] [Info] Total Bins 22
[LightGBM] [Info] Number of data points in the train set: 40, number of used features: 2
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Start training from score 0.873750
Is that a bug or an expected behavior? in lightgbm does use scikit learn fit interface.
I use an overnight master version 5b7a6f3e7150aeb704d1dd2b852d246af3e913a3 tag to be exact from Jul 12.
I'm trying to dig into the code. So far I see few things:
scikit.train interface appears to have an extra syncronization step before fitting first tree. lgb.train doesn't have it. Dunno yet where it comes from. (I see some Network::Allreduce operations)
It appears that scikit.train has workers syncronized - each worker knows the correct sizes of the blocks to send and receive during reducescatter operations. For example one the first allreduce worker1 sends 208 blocks and receives 368 blocks of data (in Linkers::SendRecv), while worker2 is reversed - sends 368 and receives 208. So allreduce completes fine. ()
On the contrary, lgb.train has workers not syncronized - each worker has numbers for send and receive blocks during reducescatter at the first DataParallelTreeLearner::FindBestSplits encounter. But they don't match. Worker1 sends 208 abd wants to receive 400. Worker2 sends 192 and wants to receive 176. So, the worker that wants to receive more just hangs. The other worker eventually hangs too.
Possibly it has something to do with lgb.Dataset. That thing may need to have same bins or something. I tried to force it by forcedbins_filename parameter. But it doesn't seem to help with lgb.train.
Success. If I remove the following line from the example:
Everything works. So I guess we can't use construct on Dataset when using distributed training.

Why the performance of pyspark script doesn't improve with more number of cores and executors?

I have a script for binary classification by loading a pre-trained model. I wonder why when I try different combinations of num-executors and executor-cores, I always get approximately the same performance. Here is important lines in my pyspark script:
start = time.time()
# extract evaluation pairs
aug_comb_mldf ='eventId').crossJoin('eventId').withColumnRenamed('eventId', 'eventIde'))
pt1 = time.time()
# feature enigineering
feats_titles = ["feat1f", "feat2f", "feat3f",
"feat4f", "feat5f", "feat6f"]
augfldf = aug_comb_mldf.join(dfml_partial.withColumnRenamed('eventId', 'eventId').alias('a'), ['eventId'], 'inner') \
.join(dfallaug.withColumnRenamed('eventId', 'eventIde').drop('id').alias('b'), ['eventIde'], 'inner')\
.withColumn('feat1f', when(expr('a.feat1 = b.feat1'), 1).otherwise(0))\
.withColumn('feat2f', when(expr('a.feat2 = b.feat2'), 1).otherwise(0))\
.withColumn('feat3f', when(expr('a.feat3 = b.feat3'), 1).otherwise(0))\
.withColumn('feat4f', when(expr('a.feat4 = b.feat4'), 1).otherwise(0))\
.withColumn('feat5f', when(expr('a.feat5 = b.feat5'), 1).otherwise(0))\
.withColumn('feat6f', when(expr('a.feat6 = b.feat6'), 1).otherwise(0))\
pt2 = time.time()
# Make predictions.
aug_predictions = model.transform(augfldf)
pt3 = time.time()
aug_predictions_true ="eventId", "eventIde", "id", "probability")
aug_predictions_true = aug_predictions_true.filter((aug_predictions.predictedLabel != '0'))
# find highest prob
aug_predictions_true = aug_predictions_true.withColumn("rank", row_number().over(w.orderBy(desc("probability"))))\
pt4 = time.time()
print ("pt1-start = ", pt1-start)
print ("pt2-start = ", pt2-pt1)
print ("pt3-start = ", pt3-pt2)
print ("pt4-start = ", pt4-pt3)
print ("total = ", pt4-start)
Here is the performance:
('pt1-start = ', 0.034136056900024414)
('pt2-start = ', 0.41227102279663086)
('pt3-start = ', 0.12337303161621094)
('pt4-start = ', 0.1068110466003418)
('total = ', 0.676591157913208)
Here is how I run this script:
spark-submit --master yarn --num-executors 16 --executor-cores 4 --executor-memory 12g --driver-memory 6g
I ran spark-submit with different combination of four config that you see and I always get approximately the same performance.
This --executor-cores means the number parallel threads it will run inside the executors.
But in python theres no concept of threading because of GIL (Global interpreter Lock) so it will not run the parallel threads.
So to improve the performance of your runtime i would suggest to play with more number of executor instead of increasing the number of cores.

Scheduling periodic requests to multiple devices using a shared channel

I need to request data periodically from a configurable number of devices at configurable intervals (per device). All devices are connected to a shared data bus, so only one device can send data at the same time.
The devices have very little memory, so each device can only keep the data for a certain period of time before it is overwritten by the next chunk. This means I need to make sure to request data from any given device while it is still available, or else it will be lost.
I am looking for an algorithm that, given a list of devices and their respective timing properties, finds a feasible schedule in order to achieve minimal data loss.
I guess each device could be formally described using the following properties:
data_interval: time it takes for the next chunk of data to become available
max_request_interval: maximum amount of time between requests that will not cause data loss
processing_time: time it takes to send a request and fully receive the corresponding response containing the requested data
Basically, I need to make sure to request data from every device once its data is ready and not yet expired, while keeping in mind the deadlines for all other devices.
Is there some sort of algorithm for this kind of problem? I highly doubt I'm the first person to ever encounter a situation like this. Searching for existing solutions online didn't yield many useful results, mainly because scheduling algorithms are mostly used for operating systems and such, where scheduled processes can be paused and resumed at will. I can't do this in my case, however, since the process of requesting and receiving a chunk of data is atomic, i.e. it can only be performed in its entirety or not at all.
I solved this problem using non-preemptive deadline monotonic scheduling.
Here's some python code for anyone interested:
"""This module implements non-preemptive deadline monotonic scheduling (NPDMS) to compute a schedule of periodic,
non-preemptable requests to slave devices connected to a shared data bus"""
from math import gcd
from functools import reduce
from typing import List
class Slave:
def __init__(self, name: str, period: int, processing_time: int, offset=0, deadline=None): = name
self.period = int(period)
self.processing_time = int(processing_time)
self.offset = int(offset)
if self.offset >= self.period:
raise ValueError("Slave %s: offset must be < period" % name)
self.deadline = int(deadline) if deadline else self.period
if self.deadline > self.period:
raise ValueError("Slave %s: deadline must be <= period" % name)
class Request:
def __init__(self, slave: Slave, start_time: int):
self.slave = slave
self.start_time = start_time
self.end_time = start_time + slave.processing_time
self.duration = self.end_time - self.start_time
def overlaps_with(self, other: 'Request'):
min_duration = self.duration + other.duration
start = min(other.start_time, self.start_time)
end = max(other.end_time, self.end_time)
effective_duration = end - start
return effective_duration < min_duration
class Scenario:
def __init__(self, *slaves: Slave):
self.slaves = list(slaves)
self.slaves.sort(key=lambda slave: slave.deadline)
# LCM of all slave periods
self.cycle_period = reduce(lambda a, b: a * b // gcd(a, b), [slave.period for slave in slaves])
def compute_schedule(self, resolution=1) -> 'Schedule':
request_pool = []
for t in range(0, self.cycle_period, resolution):
for slave in self.slaves:
if (t - slave.offset) % slave.period == 0 and t >= slave.offset:
request_pool.append(Request(slave, t))
scheduled_requests = []
current_request = request_pool.pop()
t = current_request.start_time
while t < self.cycle_period:
ongoing_request = Request(current_request.slave, t)
while ongoing_request.start_time <= t < ongoing_request.end_time:
t += resolution
if len(request_pool):
current_request = request_pool.pop()
t = max(current_request.start_time, t)
current_request = None
if current_request:
return Schedule(self, scheduled_requests, request_pool)
class Schedule:
def __init__(self, scenario: Scenario, requests: List[Request], unscheduled: List[Request] = None):
self.scenario = scenario
self.requests = requests
self.unscheduled_requests = unscheduled if unscheduled else []
self._utilization = 0
for slave in self.scenario.slaves:
self._utilization += float(slave.processing_time) / float(slave.period)
self._missed_deadlines_dict = {}
for slave in self.scenario.slaves:
periods = scenario.cycle_period // slave.period
missed_deadlines = []
for period in range(periods):
start = period * slave.period
end = start + slave.period
request = self._find_request(slave, start, end)
if request:
if request.start_time < (start + slave.offset) or request.end_time > start + slave.deadline:
if missed_deadlines:
self._missed_deadlines_dict[slave] = missed_deadlines
self._overlapping_requests = []
for i in range(0, len(requests)):
if i == 0:
previous_request = requests[i - 1]
current_request = requests[i]
if current_request.overlaps_with(previous_request):
self._overlapping_requests.append((current_request, previous_request))
self._incomplete_requests = []
for request in self.requests:
if request.duration < request.slave.processing_time:
def is_feasible(self) -> bool:
return self.utilization <= 1 \
and not self.has_missed_deadlines \
and not self.has_overlapping_requests \
and not self.has_unscheduled_requests \
and not self.has_incomplete_requests
def utilization(self) -> float:
return self._utilization
def has_missed_deadlines(self) -> bool:
return len(self._missed_deadlines_dict) > 0
def has_overlapping_requests(self) -> bool:
return len(self._overlapping_requests) > 0
def has_unscheduled_requests(self) -> bool:
return len(self.unscheduled_requests) > 0
def has_incomplete_requests(self) -> bool:
return len(self._incomplete_requests) > 0
def _find_request(self, slave, start, end) -> [Request, None]:
for r in self.requests:
if r.slave == slave and r.start_time >= start and r.end_time < end:
return r
return None
def read_scenario(file) -> Scenario:
from csv import DictReader
return Scenario(*[Slave(**row) for row in DictReader(file)])
def write_schedule(schedule: Schedule, file):
from csv import DictWriter
writer = DictWriter(file, fieldnames=["name", "start", "end"])
for request in schedule.requests:
writer.writerow({"name":, "start": request.start_time, "end": request.end_time})
if __name__ == '__main__':
import argparse
import sys
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter,
description='Use non-preemptive deadline monotonic scheduling (NPDMS) to\n'
'compute a schedule of periodic, non-preemptable requests to\n'
'slave devices connected to a shared data bus.\n\n'
'Prints the computed schedule to stdout as CSV. Returns with\n'
'exit code 0 if the schedule is feasible, else 1.')
parser.add_argument("csv_file", metavar="SCENARIO", type=str,
help="A csv file describing the scenario, i.e. a list\n"
"of slave devices with the following properties:\n"
"* name: name/id of the slave device\n\n"
"* period: duration of the period of time during\n"
" which requests must be dispatched\n\n"
"* processing_time: amount of time it takes to\n"
" fully process a request (worst-case)\n\n"
"* offset: offset for initial phase-shifting\n"
" (default: 0)\n\n"
"* deadline: amount of time during which data is\n"
" available after the start of each period\n"
" (default: <period>)")
parser.add_argument("-r", "--resolution", type=int, default=1,
help="The resolution used to simulate the passage of time (default: 1)")
args = parser.parse_args()
with open(args.csv_file, 'r') as f:
schedule = read_scenario(f).compute_schedule(args.resolution)
write_schedule(schedule, sys.stdout)
exit(0 if schedule.is_feasible else 1)

Multiprocessing on Python 3 Jupyter

I come here because I have an issue with my Jupiter's Python3 notebook.
I need to create a function that uses the multiprocessing library.
Before to implement it, I make some tests.
I found a looooot of different examples but the issue is everytime the same : my code is executed but nothing happens in the notebook's interface :
The code i try to run on jupyter is this one :
import os
from multiprocessing import Process, current_process
def doubler(number):
A doubling function that can be used by a process
result = number * 2
proc_name = current_process().name
print('{0} doubled to {1} by: {2}'.format(
number, result, proc_name))
return result
if __name__ == '__main__':
numbers = [5, 10, 15, 20, 25]
procs = []
proc = Process(target=doubler, args=(5,))
for index, number in enumerate(numbers):
proc = Process(target=doubler, args=(number,))
proc2 = Process(target=doubler, args=(number,))
proc = Process(target=doubler, name='Test', args=(2,))
for proc in procs:
It's OK when I just run my code without Jupyter but with the command "python" and I can see the logs :
Is there, for my example, and in Jupyter, a way to catch the results of my two tasks (proc1 and proc2 which both call thefunction "doubler") in a variable/object that I could use after ?
If "yes", how can I do it?
#Konate's answer really helped me. Here is a simplified version using multiprocessing.pool:
import multiprocessing
def double(a):
return a * 2
def driver_func():
with multiprocessing.Pool(PROCESSES) as pool:
params = [(1, ), (2, ), (3, ), (4, )]
results = [pool.apply_async(double, p) for p in params]
for r in results:
print('\t', r.get())
I succeed by using multiprocessing.pool.
I was inspired by this approach :
def test():
print('Creating pool with %d processes\n' % PROCESSES)
with multiprocessing.Pool(PROCESSES) as pool:
TASKS = [(mul, (i, 7)) for i in range(10)] + \
[(plus, (i, 8)) for i in range(10)]
results = [pool.apply_async(calculate, t) for t in TASKS]
imap_it = pool.imap(calculatestar, TASKS)
imap_unordered_it = pool.imap_unordered(calculatestar, TASKS)
print('Ordered results using pool.apply_async():')
for r in results:
print('\t', r.get())
print('Ordered results using pool.imap():')
for x in imap_it:
print('\t', x)
For more, the code is at :
Another way of running multiprocessing jobs in a Jupyter notebook is to use one of the approaches supported by the nbmultitask package.
This works for me on MAC (cannot make it work on windows):
import multiprocessing as mp
mp_start_count = 0
if __name__ == '__main__':
if mp_start_count == 0:
mp_start_count += 1
