I am trying to use metal to implement a hardware-accelerated image filter (seam carving for anyone interested) . One step involves running code row by row, with calculations for each row depending on calculations on the above row. However, the calculations for each row can be parallelized by pixels.
One approach would be to schedule the kernel once for each row, but I am sure there is a better way to do it as this would cause a lot of overhead.
Is there some way to tell Metal in which order to execute thread groups?
Related
I'm implementing a Kalman Filter which fuses 3d position data (provided from 2 different computer vision algorithms). I am modeling the problem with a 9-dimensional state vector (position, velocity, and acceleration). However, the data from each sensor does not come at the same time. Since I compute the velocity by considering the time step between the reception of the previous data and the current data point, two consecutive data points can be quite different but separated by only a very small time step, thus making it seem like the position has changed rapidly.
I am wondering if anyone has insight or direction on the best way to approach this problem- will the kalman filter itself be tolerant of this behaviour? Or should I place all data received within a time window into a bin, and perform the update/predict cycle less frequently on a batch of data? The resources I've seen for utilizing kalman filter in object tracking have used only one camera (i.e. synchronous data), so I'm having trouble finding information related to my use case.
Any help is very much appreciated! Thank you!
From all what I got to know from your question and our conversation in the comments let me first shortly describe the issue and suggest the solution.
A quick recap
You have a system with two independent sensors, which take measurements with different rates (30Hz and 5Hz) (and maybe have some time jitter). The good news is that each such measurement is completely sufficient to proceed an update step of your kalman filter. Each measurement have a time stamp.
Another important point is, that the measurements (maybe) have poor precision, so that the change in position looks not plausible.
A possible solution
Define a smallest time interval for calling your kalman filter, so that none of the recieved measurements has to wait too long to be processed. It looks for me like a 100Hz rate could be a good first choice. In this case your dt would be 0.01s.
Design your F and Q matrices based on the chosen dt (they both strongly depend on this value).
In each call without measurement execute the prediction step. As soon as a measurement comes, do update. So your call sequence would look like:
call sequence:
init()
predict()
predict()
predict()
predict()
update(sensor1)
predict()
update(sensor2)
update(sensor1)
predict()
predict()
update(sensor1)
predict()
and so on...
To deal with the precision issue you could use a reference signal (the ground truth). Analyze the error in each sensor reading for each signal (x, y, z) compared to the reference. A kalman filter can work well ONLY with readings, whose error is normally distributed with a zero mean. If you see some systematical offset, may be you can get rid of it. From the observed error you can calculate the standard deviation (and the variance), so you can tell your filter how good the measurements are. It will be your R matrix.
If you don't have a reference you can take some measurements while standing still on the same place. So your reference position would be constant and you could have a look at the dispersion of the readings.
Tune elements of your Q matrix and describe the possible dynamic of your state elements. A smaller Q element for position would tell the filter not to change it too fast. So the (possible) poor performance of your sensors will be partially eliminated (think of a low pass filter as intuition).
I hope it can help you. Please correct me if I understood something wrong.
It would be helpful to see a plot of your sensor readings (and if possible of the reference trajectory).
So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.
I have recorded the
Weights,
Gradients,
Logits
of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.
My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.
What might the cause of this be? I dont know where any possible source of randomness might be coming from...
Thanks
There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.
One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.
Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.
Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different
The tensorflow reduce_sum op is specifically known to be non-deterministic. Furthermore, reduce_sum is used for calculating bias gradients.
This post discusses a workaround to avoid using reduce_sum (ie taking the dot product of any vector w/ a vector of all 1's is the same as reduce_sum)
I have faced the same problem..
The working solution for me was to:
1- use tf.set_random_seed(1) in order to make all tf functions have the same seed every new run
2- Training the model using CPU not the GPU to avoid GPU non-deterministic operations due to precision.
I am having a large dimension and it is taking me more and more time to process it. I would like to decrease the processing time as much as possible
there is literally hundreds of different articles on how to process ssas objects as efficient and fast as possible.
There are lots of tips and tricks that one can apply to speed up dimensions and cube processing. I managed to apply all or at least a big majority of them and I am still not happy with the result,.
I have a large dimension built on top of a table.
It has around 60 mil records and it keeps on growing fast.
It either add new rows to it or delete the existing ones. there are no updates possible
I am looking for a solution that will allow me to perform an incremental processing of my dimension.
I know that the data in the previous month will not be changed. I would like to do smth similar to partitioning of my cube but on the dimension.
I am using SLQ SERVER 2012 and to my knowledge dimension partitioning is not supported.
I am currently using process update on my dimension - I tried processing using by attribute and by table but both render almost the same result. I have hierarchies and relationships - some set to rigid. I am only using those attributes that are truly needed etc etc etc
process update has to read all the records in a dimension even those that i know have not changed. is there a way to partition a dimension? if I could tell SSAS to only process the last 3-4 weeks of data in my dimension and not touch the rest - it would greatly speed up my processing time.
I would appreciate your help
ok so I did a bit of research and I can confirm that incremental dimension processing is not supported.
it is possible to do process add on a dimension but if you have records that got deleted or updated you cannot do that
it would be a useful thing to have but MS hasn't developed it and I don't think it will
incremental processing of any table is however possible in tabular cubes
so if you have a similar requirement and your cube is not too complex then creating a tabular cube is the way to go
This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Bin
Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).
(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.
I'm using the Qt graphics API to display layers in some GIS software.
Each layer is a group, containing graphic primitives. I have a
performance issue when loading fairly large data sets, for example
this is what happens when making a group composed of ~96k circular
paths (points from a shapefile):
callgrind image http://f.imagehost.org/0750/profile-createItemGroup.png
The complete callgrind dump is here.
The QGraphicsScene::createItemGroup() call takes about 150 seconds to
complete on my 2.4GHz core2, and it seems all this time is used in
QGraphicsItemPrivate::updateEffectiveOpacity(), which itself consumes
37% of its time calling QGraphicsItem::flags() 4 billion times (the
data comes from a test script with no GUI, just a scene, not even tied
to a view).
All the rest is pretty much instantaneous (creating the items,
reading the file, etc...). I tried to disable the scene's index before
creating the group and obtained similar results.
What could I do to improve performances in this case ? If I can't is there a way to create groups faster ?
After studying the source code a little bit, I found out that the updateEffectiveOpacity has O(n²) with regard to the children of the item's parent item (search for the method qt_allChildrenCombineOpacity). This is probably also the reason that method disappeared in Qt 4.6 and apparently been replaced by something else. Anyway, you should try out setting the ItemDoesntPropagateOpacityToChildren flag on the group item (i.e. you'll have to create it yourself), at least while adding all the items.