How to set work group count appropriately in Vulkan

How to set work group count appropriately in Vulkan - parallel-processing

There used to be a parameter called CL_DEVICE_MAX_COMPUTE_UNITS that can be queried in OpenCL by calling clGetDeviceInfo, which indicates the number of parallel compute units on the OpenCL device, as a single work-group executes on a single compute unit.
However there don't seem to be a way to query that parameter in Vulkan.
Or am I missing something, as in it can actually be queried? Or we usually choose a default value (such as 256) arbitrarily when the input size is indeterminate?

Vulkan has no way to ask that question. And that's probably for the best.
First, the concept of "compute unit" was not well defined even in OpenCL. So exactly what this value means is not well understood.
Second, if the question you really want to ask is "how many work groups can execute in parallel at any one time", then the answer may be shader-dependent. For example, if a piece of hardware can execute 32 work items on a single computation unit, it may be able to populate these 32 work items from distinct work groups. That is, your notion that "a single work-group executes on a single compute unit" is not necessarily true.
If a shader's work group size is 16, there's little to be lost by running them both at the same time. Sure, different barrier usage may cause them to get split up, but it may not. It's probably better to take the chance that it'll work than to assume it won't.
And third... what exactly do you intend to do with that information? If you have X work groups to execute, issuing multiple dispatch commands in groups of CL_DEVICE_MAX_COMPUTE_UNITS isn't going to make this process go faster. And trying to interleave work groups from different compute tasks is going to be slower, due to having to reset pipelines or other state. It's better to through the whole work at the GPU and let its scheduler sort out how to apply the work items to the work groups.

Related

OpenMDAO - information on cycles

In OpenMDAO, is there any way to get the analytics about the execution of the nonlinear solvers within a coupled model (containing multiple cycles and subcycles), such as the number of iterations within each of the cycles, and execution time?

Though there is no specific functionality to get this exact data, you should be able to get the information you need from case record data which includes iteration counts and time-stamps. So you'd have to do a bit of analysis on the first/last case of a specific run of a solver to compute the run times. Iteration counts should be very strait forward.
This question seems closely related to another one, recently posted which did identify a bug in OpenMDAO. (Issue #2453). Until that bug is fixed, you'll need to use the case names to separate out which cases belong to which cycles, since you can only currently add the recorders to the components/groups and not to the nested solvers themselves. But the naming of the cases should still allow you to pull the data you need out.

Determinism in tensorflow gradient updates?

So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.
I have recorded the
Weights,
Gradients,
Logits
of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.
My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.
What might the cause of this be? I dont know where any possible source of randomness might be coming from...
Thanks

There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.
One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.
Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.
Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different

The tensorflow reduce_sum op is specifically known to be non-deterministic. Furthermore, reduce_sum is used for calculating bias gradients.
This post discusses a workaround to avoid using reduce_sum (ie taking the dot product of any vector w/ a vector of all 1's is the same as reduce_sum)

I have faced the same problem..
The working solution for me was to:
1- use tf.set_random_seed(1) in order to make all tf functions have the same seed every new run
2- Training the model using CPU not the GPU to avoid GPU non-deterministic operations due to precision.

Design tips for synchronising signals through a VHDL pipeline

I am designing a video pixel data processing pipeline in VHDL which involves several steps including multiply and divide.
I want to keep signals synchronised so that I can e.g. maintain a sync signal and output it correctly at the end of the pipeline along with manipulated pixel data which has been through several processing stages.
I assume I want to use shift registers or something to delay signals by the right number of cycles so that the output is correct, but I'm looking for advice about good ways to design this, particularly as the number of pipeline stages for different signals may vary as I evolve the design.

Good question.
I'm not aware of a complete solution but here are two partial strategies...
Interconnecting components... It would be really nice if a component could export a generic whose value was its pipeline depth. Unfortunately you can't, and dedicating a port to this seems silly (though it's probably workable; as it would be an integer constant, it would disappear in synthesis)
Failing that, pass IN a generic indicating the budget for this module. Inside the module, assert (severity FAILURE) if the budget can't be met... (this assert is checkable at synth time and at least Xilinx XST handles similar asserts)
Make the budget a hard number, and either assert if not equal to actual pipeline depth, or add pipe stages inside the module if the budget is too large, and only assert if the budget is too small.
That way you are connecting predictable modules, and the top level can perform pipeline arithmetic to balance things (e.g. passing a computed constant value to a programmable delay line)
Within a component... I use a single process, with registers represented as internal signals whose names reflect their pipe stage, exponent_1, exponent_2, exponent_3 and so on. Within the process, the first section describes all the actions for the first cycle, the second section describes the second cycle, and so on. Typically the "easier" paths may be copied verbatim to the next pipe stage, just to sync them with the critical path. The process is fairly organised and easy to maintain.
I might break a 32-bit multiply down into 16*16 chunks and pipeline the partial product additions. The control this gives, USED to give better results than XST gave alone...
I know some people prefer variables within a process, and I use them for intermediate results in a pipe stage, but using signals I can describe the pipeline in its natural order (thanks to postponed assignment) whereas using variables, I would have to describe it backwards!

I create a package for each of my major processing blocks, one of the constants in there is the processing delay of that block. I can then connect that up to my general-purpose "delay-line" block which has a generic for the number of cycles.
Keeping that constant in "sync" with the actual implementation is best done by a self-checking testbench.

Something to consider is delay lines (i.e. back to back registers) vs FIFOs.
Consider a module X with a pipeline delay N. FIFOs work well when there is a N is variable. The trick is remembering that you can only request new work when both the module and the FIFO can accept it. Ideally you size the FIFO so that it can contain the maximum number of items that X can work on concurrently, but sometimes that's not practical. For example, if your calculation includes accesses to a distant memory.
Another option is integrating the side channel (i.e. the path that your sync flag is taking) into the module X rather than it going outside. If you do this then if any part of the calculation has to stall, you can also stall the side channel and the two stay in sync. You can do this because you're in a scope that has all the necessary signals in it. Then all signals, whether used in the calculation or not, appear at the output at the same time.

When timing how long a quick process runs, how many runs should be used?

Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?

Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.

The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.

None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.

Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.

http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.

While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?

How to automatically tune parameters of an algorithm?

Here's the setup:
I have an algorithm that can succeed or fail.
I want it to succeed with highest probability possible.
Probability of success depends on some parameters (and some external circumstances):
struct Parameters {
float param1;
float param2;
float param3;
float param4;
// ...
};
bool RunAlgorithm (const Parameters& parameters) {
// ...
// P(return true) is a function of parameters.
}
How to (automatically) find best parameters with a smallest number of calls to RunAlgorithm ?
I would be especially happy with a readl library.
If you need more info on my particular case:
Probability of success is smooth function of parameters and have single global optimum.
There are around 10 parameters, most of them independently tunable (but some are interdependent)
I will run the tunning overnight, I can handle around 1000 calls to Run algorithm.
Clarification:
Best parameters have to found automatically overnight, and used during the day.
The external circumstances change each day, so computing them once and for all is impossible.
More clarification:
RunAlgorithm is actually game-playing algorithm. It plays a whole game (Go or Chess) against fixed opponent. I can play 1000 games overnight. Every night is other opponent.
I want to see whether different opponents need different parameters.
RunAlgorithm is smooth in the sense that changing parameter a little does change algorithm only a little.
Probability of success could be estimated by large number of samples with the same parameters.
But it is too costly to run so many games without changing parameters.
I could try optimize each parameter independently (which would result in 100 runs per parameter) but I guess there are some dependencies.
The whole problem is about using the scarce data wisely.
Games played are very highly randomized, no problem with that.

Maybe you are looking for genetic algorithms.

Why not allow the program fight with itself? Take some vector v (parameters) and let it fight with v + (0.1,0,0,0,..,0), say 15 times. Then, take the winner and modify another parameter and so on. With enough luck, you'll get a strong player, able to defeat most others.
Previous answer (much of it is irrevelant after the question was edited):
With these assumptions and that level of generality, you will achieve nothing (except maybe an impossiblity result).
Basic question: can you change the algorithm so that it will return probability of success, not the result of a single experiment? Then, use appropriate optimization technique (nobody will tell you which under such general assumptions). In Haskell, you can even change code so that it will find the probability in simple cases (probability monad, instead of giving a single result. As others mentioned, you can use a genetic algorithm using probability as fitness function. If you have a formula, use a computer algebra system to find the maximum value.
Probability of success is smooth function of parameters and have single global optimum.
Smooth or continuous? If smooth, you can use differential calculus (Lagrange multipliers?). You can even, with little changes in code (assuming your programming language is general enough), compute derivatives automatically using automatic differentiation.
I will run the tunning overnight, I can handle around 1000 calls to Run algorithm.
That complex? This will allow you to check two possible values (210=1024), out of many floats. You won't even determine order of magnitude, or even order of order of magnitude.
There are around 10 parameters, most of them independently tunable (but some are interdependent)
If you know what is independent, fix some parameters and change those that are independent of them, like in divide-and-conquer. Obviously it's much better to tune two algorithms with 5 parameters.
I'm downvoting the question unless you give more details. This has too much noise for an academic question and not enough data for a real-world question.

The main problem you have is that, with ten parameters, 1000 runs is next to nothing, given that, for each run, all you have is a true/false result rather than a P(success) associated with the parameters.
Here's an idea that, on the one hand, may make best use of your 1000 runs and, on the other hand, also illustrates the the intractability of your problem. Let's assume the ten parameters really are independent. Pick two values for each parameter (e.g. a "high" value and a "low" value). There are 1024 ways to select unique combinations of those values; run your method for each combination and store the result. When you're done, you'll have 512 test runs for each value of each parameter; with the independence assumption, that might give you a decent estimate on the conditional probability of success for each value. An analysis of that data should give you a little information about how to set your parameters, and may suggest refinements of your "high" and "low" values for future nights. The back of my mind is dredging up ANOVA as a possibly useful statistical tool here.
Very vague advice... but, as has been noted, it's a rather vague problem.

Specifically for tuning parameters for game-playing agents, you may be interested in CLOP
http://remi.coulom.free.fr/CLOP/

Not sure if I understood correctly...
If you can choose the parameters for your algorithm, does it mean that you can choose it once for all?
Then, you could simply:
have the developper run all/many cases only once, find the best case, and replace the parameters with the best value
at runtime for your real user, the algorithm is already parameterized with the best parameters
Or, if the best values change for each run ...
Are you looking for Genetic Algorithms type of approach?

The answer to this question depends on:
Parameter range. Can your parameters have a small or large range of values?
Game grading. Does it have to be a boolean, or can it be a smooth function?
One approach that seems natural to this problem is Hill Climbing.
A possible way to implement would be to start with several points, and calculate their "grade". Then figure out a favorable direction for the next point, and try to "ascend".
The main problems that I see in this question, as you presented it, is the huge range of parameter values, and the fact that the result of the run is boolean (and not a numeric grade). This will require many runs to figure out whether a set of chosen parameters are indeed good, and on the other hand, there is a huge set of parameters values yet to check. Just checking all directions will result in a (too?) large number of runs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio