Job scheduling with minimization by parallel grouping - algorithm

I have a job scheduling problem with a twist- a minimization constraint. The task is- I have many jobs, each with various dependencies on other jobs, without cycles. These jobs have categories as well, and can be ran together in parallel for free if they belong to the same category. So, I want to order the jobs so that each job comes after its dependencies, but arranged in such a way that they are grouped by category (to run many in parallel) to minimize the number of serial jobs I run. That is, adjacent jobs of the same category count as a single serial job.
I know I can sort topologically to handle dependency ordering. I’ve tried using graph coloring on the subgraphs containing each category of jobs, but I run into problems with inter-category dependency conflicts. More specifically, when I have to make a decision of which of two or more pairs of jobs to group. I can brute force this, and I can try random walks over the search space, but I’m hoping for something smarter. The former blows up exponentially in the worst case, the latter is not guaranteed to be optimal.
To put things into scale- there can be as many as a couple hundred thousand jobs to schedule at once, with maybe a couple hundred categories of jobs.
I’ve stumbled upon many optimizations such as creating a graph of dependencies, splitting into connected components, and solving each subproblem independently and merging. I also realize there’s a lower bound by either the number of colors to color each category, but not sure how to use that beyond an early exit condition.
Is there a better way to find an ordering or jobs to maximize this “grouping” of jobs of a category, in order to minimize the total number of serial jobs?

No sure if this is helpful, but instead of aiming for an algorithm, it is also possible to develop an optimization model and let a solver do the work.
A Mixed Integer Programming model can look like:
The idea is that we minimize the total makespan, or the finish time of the latest job. This will automatically try to group together jobs of the same category (to allow parallel processing).
I created some random data for 50 jobs and 5 categories. The data set includes some due dates and some precedence constraints.
---- 28 SET j jobs
job1 , job2 , job3 , job4 , job5 , job6 , job7 , job8 , job9 , job10, job11, job12
job13, job14, job15, job16, job17, job18, job19, job20, job21, job22, job23, job24
job25, job26, job27, job28, job29, job30, job31, job32, job33, job34, job35, job36
job37, job38, job39, job40, job41, job42, job43, job44, job45, job46, job47, job48
job49, job50
---- 28 SET c category
cat1, cat2, cat3, cat4, cat5
---- 28 SET jc job-category mapping
cat1 cat2 cat3 cat4 cat5
job1 YES
job2 YES
job3 YES
job4 YES
job5 YES
job6 YES
job7 YES
job8 YES
job9 YES
job10 YES
job11 YES
job12 YES
job13 YES
job14 YES
job15 YES
job16 YES
job17 YES
job18 YES
job19 YES
job20 YES
job21 YES
job22 YES
job23 YES
job24 YES
job25 YES
job26 YES
job27 YES
job28 YES
job29 YES
job30 YES
job31 YES
job32 YES
job33 YES
job34 YES
job35 YES
job36 YES
job37 YES
job38 YES
job39 YES
job40 YES
job41 YES
job42 YES
job43 YES
job44 YES
job45 YES
job46 YES
job47 YES
job48 YES
job49 YES
job50 YES
---- 28 PARAMETER length job duration
job1 11.611, job2 12.558, job3 11.274, job4 7.839, job5 5.864, job6 6.025, job7 11.413
job8 10.453, job9 5.315, job10 12.924, job11 5.728, job12 6.757, job13 10.256, job14 12.502
job15 6.781, job16 5.341, job17 10.851, job18 11.212, job19 8.894, job20 8.587, job21 7.430
job22 7.464, job23 6.305, job24 14.334, job25 8.799, job26 12.834, job27 8.000, job28 6.255
job29 12.489, job30 5.692, job31 7.020, job32 5.051, job33 7.696, job34 9.999, job35 6.513
job36 6.742, job37 8.306, job38 8.169, job39 8.221, job40 14.640, job41 14.936, job42 8.699
job43 8.729, job44 12.720, job45 8.967, job46 14.131, job47 6.196, job48 12.355, job49 5.554
job50 10.763
---- 28 SET before dependencies
job3 job9 job13 job21 job23 job27 job32 job41 job42
job1 YES
job3 YES
job4 YES
job8 YES
job9 YES YES
job12 YES
job14 YES
job21 YES
job26 YES
job31 YES
+ job43 job46 job48
job10 YES YES
job11 YES
---- 28 PARAMETER due some jobs have a due date
job16 50.756, job19 57.757, job20 58.797, job25 74.443, job29 65.605, job32 55.928, job50 58.012
The solution can look like:
This model (with this particular data set) solved in about 30 seconds (using Cplex). Of course it is noted that, in general, these models can be difficult to solve to optimality.

Here is a CP Optimizer model which solves very quickly using the most recent 12.10 version (a couple of seconds).
The model is quite natural using precedence constraints and a "state function" to model the batching constraints (no two tasks from different categories can execute concurrently).
DURATION = [
11611, 12558, 11274, 7839, 5864, 6025, 11413, 10453, 5315, 12924,
5728, 6757, 10256, 12502, 6781, 5341, 10851, 11212, 8894, 8587,
7430, 7464, 6305, 14334, 8799, 12834, 8000, 6255, 12489, 5692,
7020, 5051, 7696, 9999, 6513, 6742, 8306, 8169, 8221, 14640,
14936, 8699, 8729, 12720, 8967, 14131, 6196, 12355, 5554, 10763
]
CATEGORY = [
1, 5, 3, 2, 2, 2, 2, 5, 1, 3,
5, 3, 5, 4, 1, 4, 1, 2, 4, 3,
2, 2, 1, 1, 3, 5, 2, 4, 4, 2,
1, 3, 1, 5, 2, 2, 3, 4, 4, 3,
3, 1, 2, 1, 2, 1, 4, 3, 4, 2
]
PREC = [
(0, 2), (2, 8), (3, 12), (7, 26), (8, 20), (8, 22), (11, 22),
(13, 40), (20, 26), (25, 41), (30, 31), (9, 45), (9, 47), (10, 42)
]
DEADLINE = [ (15, 50756), (18, 57757), (19, 58797),
(24, 74443), (28, 65605), (31, 55928), (49, 58012) ]
assert(len(CATEGORY) == len(DURATION))
# ===========================================================================
from docplex.cp.model import CpoModel
mdl = CpoModel()
TASKS = range(len(DURATION))
# Decision variables - interval variables with duration (length) and name
itv = [
mdl.interval_var(length=DURATION[j], name="ITV_{}".format(j+1))
for j in TASKS
]
# Deadlines - constrain the end of the interval.
for j,d in DEADLINE :
mdl.add(mdl.end_of(itv[j]) <= d)
# Precedences - use end_before_start
for b, a in PREC :
mdl.add(mdl.end_before_start(itv[b], itv[a]))
# Batching. This uses a "state function" which is an unknown function of
# time which needs to be decided by CP Optimizer. We say that this function
# must take the value of the category of the interval during the interval
# (using always_equal meaning the state function is always equal to a value
# over the extent of the interval). This means that only tasks of a particular
# category can execute at the same time.
r = mdl.state_function()
for j in TASKS :
mdl.add(mdl.always_equal(r, itv[j], CATEGORY[j]))
# Objective. Minimize the latest task end.
makespan = mdl.max(mdl.end_of(itv[j]) for j in TASKS)
mdl.add(mdl.minimize(makespan))
# Solve it, making sure we get the absolute optimal (0 tolerance)
# and limiting the log a bit. 's' contains the solution.
s = mdl.solve(OptimalityTolerance=0, LogVerbosity="Terse")
# Get the final makespan
sol_makespan = s.get_objective_values()[0]
# Print the solution by zone
# s[X] gets the value of unknown X in the solution s
# s[r] gets the value of the state function in the solution
# this is a list of triples (start, end, value) representing
# the full extent of the state function over the whole time line.
zones = s[r]
# Iterate over the zones, ignoring the first and last ones, which
# are the zones before the first and after the last task.
for (start, end, value) in zones[1:-1] :
print("Category is {} in window [{},{})".format(value, start, end))
for j in TASKS:
(istart, iend, ilength) = s[itv[j]] # intervals are start/end/length
if istart >= start and iend <= end:
print("\t{} # {} -- {} --> {}".format(
itv[j].get_name(), istart, ilength, iend))

for job scheduling I encourage you to have a look at CPOptimizer within CPLEX introduction to CPOptimizer
A basic jobshop model will look like
using CP;
int nbJobs = ...;
int nbMchs = ...;
range Jobs = 0..nbJobs-1;
range Mchs = 0..nbMchs-1;
// Mchs is used both to index machines and operation position in job
tuple Operation {
int mch; // Machine
int pt; // Processing time
};
Operation Ops[j in Jobs][m in Mchs] = ...;
dvar interval itvs[j in Jobs][o in Mchs] size Ops[j][o].pt;
dvar sequence mchs[m in Mchs] in all(j in Jobs, o in Mchs : Ops[j][o].mch == m) itvs[j][o];
minimize max(j in Jobs) endOf(itvs[j][nbMchs-1]);
subject to {
forall (m in Mchs)
noOverlap(mchs[m]);
forall (j in Jobs, o in 0..nbMchs-2)
endBeforeStart(itvs[j][o], itvs[j][o+1]);
}
as can be seen in the sched_jobshop example

Related

Dynamic number system in Qlik Sense

My data consists of large numbers, I have a column say - 'amount', while using it in charts(sum of amount in Y axis) it shows something like 1.4G, I want to show them as if is billion then e.g. - 2.8B, or in millions then 80M or if it's in thousands (14,000) then simply- 14k.
I have used - if(sum(amount)/1000000000 > 1, Num(sum(amount)/1000000000, '#,###B'), Num(sum(amount)/1000000, '#,###M')) but it does not show the M or B at the end of the figure and also How to include thousand in the same code.
EDIT: Updated to include the dual() function.
This worked for me:
=dual(
if(sum(amount) < 1, Num(sum(amount), '#,##0.00'),
if(sum(amount) < 1000, Num(sum(amount), '#,##0'),
if(sum(amount) < 1000000, Num(sum(amount)/1000, '#,##0k'),
if(sum(amount) < 1000000000, Num(sum(amount)/1000000, '#,##0M'),
Num(sum(amount)/1000000000, '#,##0B')
))))
, sum(amount)
)
Here are some example outputs using this script to format it:
=sum(amount)
Formatted
2,526,163,764
3B
79,342,364
79M
5,589,255
5M
947,470
947k
583
583
0.6434
0.64
To get more decimals for any of those, like 2.53B instead of 3B, you can format them like '#,##0.00B' by adding more zeroes at the end.
Also make sure that the Number Formatting property is set to Auto or Measure expression.

Teacher-Student System: Training Student with Top-k Hypotheses List

I want to configure a teacher-student system, where a teacher seq2seq model generates a top-k list of hypotheses, which are used to train a student seq2seq model.
My plan to implement this, is to batch the teacher hypotheses, meaning that the teacher outputs a tensor with batch axis length of k * B, where B is the input batch axis length. The output batch tensor, now contains k hypotheses for each sequence in the input batch tensor, sorted by position of the associated input sequence in the input batch.
This tensor is set as the student’s training target. However, the student’s batch tensor still has a batch axis length of B, so I utilize tf.repeat to repeat the sequences in the output tensor of the student’s encoder k times, before feeding that tensor into the student’s decoder.
For debugging purposes I made the simplification to repeat the single best hypothesis of the teacher, for now, before I’m going to implement the top-k list selection.
Here is a summary of my config file:
[...]
# Variables:
student_target = "teacher_hypotheses_stack"
[...]
# Custom repeat function:
def repeat(source, src_name="source", **kwargs):
import tensorflow as tf
input = source(0)
input = tf.Print(input, [src_name, "in", input, tf.shape(input)])
output = tf.repeat(input, repeats=3, axis=1)
output = tf.Print(output, [src_name, "out", output, tf.shape(output)])
return output
def repeat_t(source, **kwargs):
return repeat(source, "teacher")
def repeat_s(source, **kwargs):
return repeat(source, "student")
[...]
# Configuration of the teacher + repeating of its output
**teacher_network(), # The teacher_network is a encoder-decoder seq2seq model. The teacher performs search during training and is untrainable
"teacher_stack": {
"class": "eval", "from": ["teacher_decision"], "eval": repeat_t,
"trainable": False
# "register_as_extern_data": "teacher_hypotheses_stack"
},
"teacher_stack_reinterpreter": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
"class": "reinterpret_data",
"set_axes": {"B": 1, "T": 0},
"enforce_time_major": True,
"from": ["teacher_stack"],
"trainable": False,
"register_as_extern_data": "teacher_hypotheses_stack"
}
[...]
# Repeating of the student's encoder ouput + configuration of its decoder
"student_encoder": {"class": "copy", "from": ["student_lstm6_fw", "student_lstm6_bw"]}, # dim: EncValueTotalDim
"student_encoder_repeater": {"class": "eval", "from": ["student_encoder"], "eval": repeat},
"student_encoder_stack": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
"class": "reinterpret_data",
"set_axes": {"B": 1, "T": 0},
"enforce_time_major": True,
"from": ["student_encoder_repeater"]
},
"student_enc_ctx": {"class": "linear", "activation": None, "with_bias": True, "from": ["student_encoder_stack"], "n_out": EncKeyTotalDim}, # preprocessed_attended in Blocks
"student_inv_fertility": {"class": "linear", "activation": "sigmoid", "with_bias": False, "from": ["student_encoder_stack"], "n_out": AttNumHeads},
"student_enc_value": {"class": "split_dims", "axis": "F", "dims": (AttNumHeads, EncValuePerHeadDim), "from": ["student_encoder_stack"]}, # (B, enc-T, H, D'/H)
"model1_output": {"class": "rec", "from": [], 'cheating': config.bool("cheating", False), "unit": {
'output': {'class': 'choice', 'target': student_target, 'beam_size': beam_size, 'cheating': config.bool("cheating", False), 'from': ["model1_output_prob"], "initial_output": 0},
"end": {"class": "compare", "from": ["output"], "value": 0},
'model1_target_embed': {'class': 'linear', 'activation': None, "with_bias": False, 'from': ['output'], "n_out": target_embed_size, "initial_output": 0}, # feedback_input
"model1_weight_feedback": {"class": "linear", "activation": None, "with_bias": False, "from": ["prev:model1_accum_att_weights"], "n_out": EncKeyTotalDim, "dropout": 0.3},
"model1_s_transformed": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_s"], "n_out": EncKeyTotalDim, "dropout": 0.3},
"model1_energy_in": {"class": "combine", "kind": "add", "from": ["base:student_enc_ctx", "model1_weight_feedback", "model1_s_transformed"], "n_out": EncKeyTotalDim},
"model1_energy_tanh": {"class": "activation", "activation": "tanh", "from": ["model1_energy_in"]},
"model1_energy": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_energy_tanh"], "n_out": AttNumHeads}, # (B, enc-T, H)
"model1_att_weights": {"class": "softmax_over_spatial", "from": ["model1_energy"]}, # (B, enc-T, H)
"model1_accum_att_weights": {"class": "eval", "from": ["prev:model1_accum_att_weights", "model1_att_weights", "base:student_inv_fertility"],
"eval": "source(0) + source(1) * source(2) * 0.5", "out_type": {"dim": AttNumHeads, "shape": (None, AttNumHeads)}},
"model1_att0": {"class": "generic_attention", "weights": "model1_att_weights", "base": "base:student_enc_value"}, # (B, H, V)
"model1_att": {"class": "merge_dims", "axes": "except_batch", "from": ["model1_att0"]}, # (B, H*V)
"model1_s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:model1_target_embed", "prev:model1_att"], "n_out": 1000, "dropout": 0.3}, # transform
"model1_readout_in": {"class": "linear", "from": ["model1_s", "prev:model1_target_embed", "model1_att"], "activation": None, "n_out": 1000, "dropout": 0.3}, # merge + post_merge bias
"model1_readout": {"class": "reduce_out", "mode": "max", "num_pieces": 2, "from": ["model1_readout_in"]},
"model1_output_prob": {
"class": "softmax", "from": ["model1_readout"], "dropout": 0.3,
"target": student_target,
"loss": "ce", "loss_opts": {"label_smoothing": 0.1}
}
}, "target": student_target},
[...]
Running this config will print the following error message to the console:
[...]
Create Adam optimizer.
Initialize optimizer (default) with slots ['m', 'v'].
These additional variable were created by the optimizer: [<tf.Variable 'optimize/beta1_power:0' shape=() dtype=float32_ref>, <tf.Variable 'optimize/beta2_power:0' shape=() dtype=float32_ref>].
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
TensorFlow exception: assertion failed: [x.shape[0] != y.shape[0]] [69 17] [23]
[[node objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[...]
Execute again to debug the op inputs...
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1_1:0' shape=(1,) dtype=int32> = shape (1,), dtype int32, min/max 23/23, ([23])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0_1:0' shape=() dtype=string> = bytes(b'x.shape[0] != y.shape[0]')
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_2:0' shape=(2,) dtype=int32> = shape (2,), dtype int32, min/max 17/69, ([69 17])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All_1:0' shape=() dtype=bool> = bool_(False)
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
Op inputs:
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All:0' shape=() dtype=bool>: bool_(False)
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0:0' shape=() dtype=string>: bytes(b'x.shape[0] != y.shape[0]')
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape:0' shape=(2,) dtype=int32>: shape (2,), dtype int32, min/max 17/69, ([69 17])
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1:0' shape=(1,) dtype=int32>: shape (1,), dtype int32, min/max 23/23, ([23])
Step meta information:
{'seq_idx': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
'seq_tag': ['seq-0','seq-1','seq-2','seq-3','seq-4','seq-5','seq-6','seq-7','seq-8','seq-9','seq-10','seq-11','seq-12','seq-13','seq-14','seq-15','seq-16','seq-17','seq-18','seq-19','seq-20','seq-21','seq-22']}
Feed dict:
<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 80) dtype=float32>: shape (23, 42, 80), dtype float32, min/max -0.5/0.4, mean/stddev -0.050000004/0.28722814, Data(name='data', shape=(None, 80), batch_shape_meta=[B,T|'time:var:extern_data:data',F|80])
<tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 42/42, ([42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42])
<tf.Tensor 'extern_data/placeholders/source_text/source_text:0' shape=(?, ?, 512) dtype=float32>: shape (23, 13, 512), dtype float32, min/max -0.5/0.4, mean/stddev -0.050011758/0.28722063, Data(name='source_text', shape=(None, 512), available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:source_text',F|512])
<tf.Tensor 'extern_data/placeholders/source_text/source_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 13/13, ([13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13])
<tf.Tensor 'extern_data/placeholders/target_text/target_text:0' shape=(?, ?) dtype=int32>: shape (23, 17), dtype int32, min/max 6656/6694, Data(name='target_text', shape=(None,), dtype='int32', sparse=True, dim=35209, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:target_text'])
<tf.Tensor 'extern_data/placeholders/target_text/target_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 17/17, ([17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17])
<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>: bool(True)
EXCEPTION
[...]
File "home/philipp/Documents/bachelor-thesis/returnn/repository/TFUtil.py", line 4374, in sparse_labels_with_seq_lens
x = check_dim_equal(x, 0, seq_lens, 0)
[...]
So, the network is build without errors, but on the first training step, it crashes due to an assertion error. To me it looks like RETURNN or TensorFlow validates the batch length against its original value somehow. But I don’t know where and why, so I have no clue what to do about this.
What am I doing wrong? Is my idea even implementable with RETURNN this way?
EDIT (10th June 2020): For clarification: My ultimate goal is to let the teacher generate a top-k list of hypotheses for each input sequence, which are then used to train the student. So, for each input sequence of the student, there are k solutions/target sequences.
To train the student, it must predict the probability of each hypothesis, and then the cross-entropy loss is calculated to determine the update gradients. But if there are k target sequences for each input sequence, the student must decode the encoder states k times, at each time targeting a different target sequence.
This is why I want to repeat the encoder states k times, to make the student decoder’s data parallel and then use the default cross-entropy loss implementation of RETURNN:
input-seq-1 --- teacher-hyp-1-1;
input-seq-1 --- teacher-hyp-1-2;
...;
input-seq-1 --- teacher-hyp-1-k;
input-seq-2 --- teacher-hyp-2-1;
...
Is there a more proper way to achieve my goal?
EDIT (12th June 2020 #1): Yes, I know that the DecisionLayer of the teacher already selects the best hypothesis and that this way, I’m only repeating that best hypothesis k times. I’m doing this as an intermediate step towards my ultimate goal. Later, I want to fetch the top-k list from the teacher’s ChoiceLayer somehow, but I felt like this is a different construction site.
But Albert, you say RETURNN would extend the data on batch dimension automatically somehow? How can I imagine that?
EDIT (12th June 2020 #2): Okay, now I select the top-k (this time k=4) hypotheses list from the teacher’s choice layer (or output layer) by:
"teacher_hypotheses": {
"class": "copy", "from": ["extra.search:teacherMT_output"],
"register_as_extern_data": "teacher_hypotheses_stack"
}
But using this Data as training target of the student leads to the error:
TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23]
[[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Which is, I assume, due to the issue that the target data of the student, the hypotheses list, has a batch axis length k=4 times longer than the one of the student’s input data/encoder state data.
Doesn’t the student encoder state data need to be extended/repeated here, to match the target data?
EDIT (12th June 2020 #3): I consider the initial issue as solved. The overall issue is continued here Teacher-Student System: Training Student With k Target Sequences for Each Input Sequence
It does not only validate the batch length. It will collapse the batch and time (it has used flatten_with_seq_len_mask, see code of Loss.init and that function) and then calculate the loss on that flattened tensor. So also the seq length need to match. This might be a problem but I'm not sure. As you have the same target also for the rec layer itself, it should have the same seq length in training.
You can debug this by carefully checking the output of debug_print_layer_output_template, i.e. check the Data (batch-shape-meta) output, if the axes are all correct as you expect them to be.
(debug_print_layer_output_template can and should always be enabled. It will not make it slower.)
You can also temporarily enable debug_print_layer_output_shape, which will really print the shape of all tensors. That way you can verify how it looks like.
Your usage of ReinterpretDataLayer looks very wrong. You should never ever explicitly set the axes by integer (like "set_axes": {"B": 1, "T": 0}). Why are you doing this at all? This could be the reason why it is messed up in the end.
Your repeat function is not very generic. You are using hard coded axes integers there as well. You should never do that. Instead, you would write sth like:
input_data = source(0, as_data=True)
input = input_data.placeholder
...
output = tf.repeat(input, repeats=3, axis=input_data.batch_dim_axis)
Did I understand this correct, that this is what you want to do? Repeat in the batch axis?
In that case, you also need to adapt the seq length information of the output of that layer. You cannot simply use that function as-is in an EvalLayer. You would also need to define out_type to a function which correctly returns the correct Data template. E.g. like this:
def repeat_out(out):
out = out.copy()
out.size_placeholder[0] = tf.repeat(out.size_placeholder[0], axis=0, repeats=3)
return out
...
"student_encoder_repeater": {
"class": "eval", "from": ["student_encoder"], "eval": repeat,
"out_type": lambda sources, **kwargs: repeat_out(sources[0].output)
}
Now you have the additional problem that every time you call this repeat_out, you will get another seq length info. RETURNN will not be able to tell whether these seq lengths are all the same or different (at compile time). And that will cause errors or strange effects. To solve this, you should reuse the same seq length. E.g. like this:
"teacher_stack_": {
"class": "eval", "from": "teacher_decision", "eval": repeat
},
"teacher_stack": {
"class": "reinterpret_data", "from": "teacher_stack_", "size_base": "student_encoder_repeater"
}
Btw, why do you want to do this repetition at all? What's the idea behind that? You repeat both the student and the teacher 3 times? So just increasing your learning rate by factor 3 would do the same?
Edit: It seems as if this is done to match the top-k list. In that case, this is all wrong, as RETURNN should already automatically do such repetition. You should not do this manually.
Edit: To understand how the repetition (and also beam search resolving in general) works, first thing is you should look at the log output (you must have debug_print_layer_output_template enabled, but you should have that anyway all the time). You will see the output of each layer, esp its Data output object. This is already useful to check if the shapes are all as you expect (check batch_shape_meta in the log). However, this is only the static shape at compile time, so batch-dim is just a marker there. You will also see the search beam information. This will keep track if the batch originates from some beam search (any ChoiceLayer basically), and has a beam, and the beam size. Now, in the code, check SearchChoices.translate_to_common_search_beam, and its usages. When you follow the code, you will see SelectSearchSourcesLayer, and effectively your case will end up with output.copy_extend_with_beam(search_choices.get_beam_info()).
Edit: To repeat, this is done automatically. You do not need to call copy_extend_with_beam manually.
If you expect to get the top-k list from the teacher, you are also likely doing it wrong, as I see that you used "teacher_decision" as input. I guess this is coming from a DecisionLayer? In that case, it already took only the first-best from the top-k beam.
Edit: Now I understand that you are ignoring this, and instead want to take only the first best, and then also repeat this. I would recommend to not do that, as you are making it unnecessary complicated, and you are kind of fighting RETURNN which knows what the batch-dim should be and will get confused. (You can make it work by what I wrote above, but really, this is just unnecessary complicated.)
Btw, there is no point in setting an EvalLayer to "trainable": False. That has no effect. The eval layer has no parameters anyway.

Prometheus Histogram Vector: All buckets fill equally?

I intend to use a Prometheus Histogram vector to monitor the execution time of request handlers in Go.
I register it so:
var RequestTimeHistogramVec = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "request_duration_seconds",
Help: "Request duration distribution",
Buckets: []float64{0.125, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5, 7.5, 10, 20},
},
[]string{"endpoint"},
)
func init() {
prometheus.MustRegister(RequestTimeHistogramVec)
}
I use it so:
startTime := time.Now()
// handle request here
metrics.RequestTimeHistogramVec.WithLabelValues("get:" + endpointName).Observe(time.Since(startTime).Seconds())
When I do a HTTP GET to the /metrics endpoint after using my endpoint a couple of times, I get - amongst other things - the following:
# HELP request_duration_seconds Request duration distribution
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{endpoint="get:/position",le="0.125"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="0.25"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="0.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="1"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="1.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="2"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="3"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="4"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="7.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="10"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="20"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="+Inf"} 6
request_duration_seconds_sum{endpoint="get:/position"} 0.022002387
request_duration_seconds_count{endpoint="get:/position"} 6
From the looks of it, all buckets are filled by the same amount, equal to the total amount of times I used my endpoint (6 times).
Why does this happen and how may I fix it?
Prometheus histogram buckets are cumulative, so in this case all the requests took less than or equal to 125ms.
In this case your choice of buckets may not be the best, you might want to make some of the buckets smaller.
This is not an error. Notice the rule for filling the bucket is le=..., meaning less or equal. Since all 6 requests succeeded quickly, all buckets were filled.

Algorithm to calculate a date for complex occupation management

Hello fellow Stack Overflowers,
I have a situation, where I need some help choosing the best way to make an algorithm work, the objective is to manage the occupation of a resource (Lets consider the resource A) to have multiple tasks, and where each task takes a specified amount of time to complete. At this first stage I don't want to involve multiple variables, so lets keep it the simple way, lets consider he only has a schedule of the working days.
For example:
1 - We have 1 resource, resource A
2 - Resource A works from 8 am to 4 pm, monday to friday, to keep it simple by now, he doesn't have lunch for now, so, 8 hours of work a day.
3 - Resource A has 5 tasks to complete, to avoid complexity at this level, lets supose each one will take exactly 10 hours to complete.
4 - Resource A will start working on this tasks at 2018-05-16 exactly at 2 pm.
Problem:
Now, all I need to know is the correct finish date for all the 5 tasks, but considering all the previous limitations.
In this case, he has 6 working days and additionaly 2 hours of the 7th day.
The expected result that I want would be: 2018-05-24 (at 4 pm).
Implementation:
I thought about 2 options, and would like to have feedback on this options, or other options that I might not be considering.
Algorithm 1
1 - Create a list of "slots", where each "slot" would represent 1 hour, for x days.
2 - Cross this list of slots with the hour schedule of the resource, to remove all the slots where the resource isn't here. This would return a list with the slots that he can actually work.
3 - Occupy the remaining slots with the tasks that I have for him.
4 - Finnaly, check the date/hour of the last occupied slot.
Disadvantage: I think this might be an overkill solution, considering that I don't want to consider his occupation for the future, all I want is to know when will the tasks be completed.
Algorithm 2
1 - Add the task hours (50 hours) to the starting date, getting the expectedFinishDate. (Would get expectedFinishDate = 2018-05-18 (at 4 pm))
2 - Cross the hours, between starting date and expectedFinishDate with the schedule, to get the quantity of hours that he won't work. (would basically get the unavailable hours, 16 hours a day, would result in remainingHoursForCalc = 32 hours).
3 - calculate new expectedFinishDate with the unavailable hours, would add this 32 hours to the previous 2018-05-18 (at 4 pm).
4 - Repeat point 2 and 3 with new expectedFinishDate untill remainingHoursForCalc = 0.
Disadvantage: This would result in a recursive method or in a very weird while loop, again, I think this might be overkill for calculation of a simple date.
What would you suggest? Is there any other option that I might not be considering that would make this simpler? Or you think there is a way to improve any of this 2 algorithms to make it work?
Improved version:
import java.util.Calendar;
import java.util.Date;
public class Main {
public static void main(String args[]) throws Exception
{
Date d=new Date();
System.out.println(d);
d.setMinutes(0);
d.setSeconds(0);
d.setHours(13);
Calendar c=Calendar.getInstance();
c.setTime(d);
c.set(Calendar.YEAR, 2018);
c.set(Calendar.MONTH, Calendar.MAY);
c.set(Calendar.DAY_OF_MONTH, 17);
//c.add(Calendar.HOUR, -24-5);
d=c.getTime();
//int workHours=11;
int hoursArray[] = {1,2,3,4,5, 10,11,12, 19,20, 40};
for(int workHours : hoursArray)
{
try
{
Date end=getEndOfTask(d, workHours);
System.out.println("a task starting at "+d+" and lasting "+workHours
+ " hours will end at " +end);
}
catch(Exception e)
{
System.out.println(e.getMessage());
}
}
}
public static Date getEndOfTask(Date startOfTask, int workingHours) throws Exception
{
int totalHours=0;//including non-working hours
//startOfTask +totalHours =endOfTask
int startHour=startOfTask.getHours();
if(startHour<8 || startHour>16)
throw new Exception("a task cannot start outside the working hours interval");
System.out.println("startHour="+startHour);
int startDayOfWeek=startOfTask.getDay();//start date's day of week; Wednesday=3
System.out.println("startDayOfWeek="+startDayOfWeek);
if(startDayOfWeek==6 || startDayOfWeek==0)
throw new Exception("a task cannot start on Saturdays on Sundays");
int remainingHoursUntilDayEnd=16-startHour;
System.out.println("remainingHoursUntilDayEnd="+remainingHoursUntilDayEnd);
/*some discussion here: if task starts at 12:30, we have 3h30min
* until the end of the program; however, getHours() will return 12, which
* substracted from 16 will give 4h. It will work fine if task starts at 12:00,
* or, generally, at the begining of the hour; let's assume a task will start at HH:00*/
int remainingDaysUntilWeekEnd=5-startDayOfWeek;
System.out.println("remainingDaysUntilWeekEnd="+remainingDaysUntilWeekEnd);
int completeWorkDays = (workingHours-remainingHoursUntilDayEnd)/8;
System.out.println("completeWorkDays="+completeWorkDays);
//excluding both the start day, and the end day, if they are not fully occupied by the task
int workingHoursLastDay=(workingHours-remainingHoursUntilDayEnd)%8;
System.out.println("workingHoursLastDay="+workingHoursLastDay);
/* workingHours=remainingHoursUntilDayEnd+(8*completeWorkDays)+workingHoursLastDay */
int numberOfWeekends=(int)Math.ceil( (completeWorkDays-remainingDaysUntilWeekEnd)/5.0 );
if((completeWorkDays-remainingDaysUntilWeekEnd)%5==0)
{
if(workingHoursLastDay>0)
{
numberOfWeekends++;
}
}
System.out.println("numberOfWeekends="+numberOfWeekends);
totalHours+=(int)Math.min(remainingHoursUntilDayEnd, workingHours);//covers the case
//when task lasts 1 or 2 hours, and we have maybe 4h until end of day; that's why i use Math.min
if(completeWorkDays>0 || workingHoursLastDay>0)
{
totalHours+=8;//the hours of the current day between 16:00 and 24:00
//it might be the case that completeWorkDays is 0, yet the task spans up to tommorrow
//so we still have to add these 8h
}
if(completeWorkDays>0)//redundant if, because 24*0=0
{
totalHours+=24*completeWorkDays;//for every 8 working h, we have a total of 24 h that have
//to be added to the date
}
if(workingHoursLastDay>0)
{
totalHours+=8;//the hours between 00.00 AM and 8 AM
totalHours+=workingHoursLastDay;
}
if(numberOfWeekends>0)
{
totalHours+=48*numberOfWeekends;//every weekend between start and end dates means two days
}
System.out.println("totalHours="+totalHours);
Calendar calendar=Calendar.getInstance();
calendar.setTime(startOfTask);
calendar.add(Calendar.HOUR, totalHours);
return calendar.getTime();
}
}
You may adjust the hoursArray[], or d.setHours along with c.set(Calendar.DAY_OF_MONTH, to test various start dates along with various task lengths.
There is still a bug , due to the addition of the 8 hours between 16:00 and 24:00:
a task starting at Thu May 17 13:00:00 EEST 2018 and lasting 11 hours will end at Sat May 19 00:00:00 EEST 2018.
I've kept a lot of print statements, they are useful for debugging purposes.
Here is the terminology explained:
I agree that algorithm 1 is overkill.
I think I would make sure I had the conditions right: hours per day (8), working days (Mon, Tue, Wed, Thu, Fri). Would then divide the hours required (5 * 10 = 50) by the hours per day so I know a minimum of how many working days are needed (50 / 8 = 6). Slightly more advanced, divide by hours per week first (50 / 40 = 1 week). Count working days from the start date to get a first shot at the end date. There was probably a remainder from the division, so use this to determine whether the tasks can end on this day or run into the next working day.

How YARN cluster metrics are calculated ? Are they an instant snapshot or an average over a period?

For example, by executing this:
http://:8088/ws/v1/cluster/metrics
I get an output like this:
{
"clusterMetrics": {
"appsSubmitted": 502521,
"appsCompleted": 501201,
"appsPending": 0,
"appsRunning": 19,
"appsFailed": 454,
"appsKilled": 847,
"reservedMB": 140400,
"availableMB": 12615232,
"allocatedMB": 8830800,
"reservedVirtualCores": 39,
"availableVirtualCores": 6140,
"allocatedVirtualCores": 2065,
"containersAllocated": 1692,
"containersReserved": 39,
"containersPending": 3960,
"totalMB": 21446032,
"totalVirtualCores": 8205,
"totalNodes": 199,
"lostNodes": 1,
"unhealthyNodes": 1,
"decommissionedNodes": 8,
"rebootedNodes": 0,
"activeNodes": 189
}
}
For instance, allocatedMB means what ?
Is it an instantaneous value ?
Is it averaged over an interval period ? The interval is configurable ?
The allocatedMB is the memory that has been assigned to the vcores (though not necessarily used). Yes, it is an instantaneous value. There is no interval, it's a snapshot of your cluster at that instant (minus the time it takes to compute these values from the data structures in the Resource Manager and then return it via the REST API).
If you want to translate your metrics it's saying:
You currently have 19 apps running.
These 19 apps are using a total of 2065 vcores.
These 2065 vcores have reserved 8830800 MB of memory for them

Resources