I am using Beckhoff PLC(Structured Text language) for a while and trying not to use for loops in all algorithm to make sure everything works in real time and won't get stuck in any loop. Time is so important for algorithms that I am working on. Any line in the code finishes in 1 ms. If I use for loop in somewhere I was thinking that when it comes into for loop, it will take 1ms for every iteration so the rest software will stay untill it finishes the loop.
For this reason I haven't writing any loop but one of my colleagues implemented his softwares with for loop. Loop starts from 1 to 100000 and every 1ms, counter increases 100000 and other counters increasing 1ms which are out of the loop.
The procedure that I know is, If PLC has time after handle everything below 1ms, it cares the rest which might be diagnosis check or IO check or whatever.
So how it can handle 100000 iteration while I identify cycle time is 1ms. What if I use 1 million different loop in program. It can handle for loop also because the program is so small. Would it behave same while software is huge? How I can understand it can finish all loops without causing overflow in cycle time?
I will try and see but for gaining insights I would like to discuss with you.
There is no universal answer to the question of whether processing can remain within your expected/required/preferred scan cycle time.
Computing power is bounded. Each iteration of a loop takes non-zero computing power so, of course, there can be more loops (and/or iterations of loops) to process than what your PLC can handle. There is no magic in your PLC that makes looping "free". On the other hand, looping is, generally speaking, very fast (compared to the processing that is performed inside the loop).
Loops are inherently sub-optimal only if they are wasteful. Replacing a loop that performs N iterations is not computationally worse than coding N times the same code without a loop.
Signs that your loop may be wasteful :
It is used only to wait for something or poll some data
It repeatedly does the same thing over the same data
It uses no matching data structure (such as an array) that you iterate over
In a PLC, you need to make sure that your loops perform processing that reliably fits within your cycle time. The worst case is what you care about : if your loop takes 0.01ms 99.9% of the time, but the last 0.1% requires 1 second, it will not work.
If a loop is what best fits an algorithm (in terms of algorithm efficiency, clarity, maintainability), it is likely to be either the fastest (or very close to the fastest) method to express your solution. You should only be afraid of loops if you code wasteful loops.
In the end, if your PLC is not powerful enough to do everything that needs to be done even with a well-coded solution, the presence or absence of loops is not the problem or the solution. But assume your PLC is fast. Very fast. I run programs with hundreds of thousands of variables, many loops executing every cycle, and am getting cycle times below 0.1ms on a Raspberry Pi. I never avoid loops for anything that is "loopy" by nature, but my loops never block waiting for something, they never allocate memory or perform processing that will introduce meaningful variation in execution time, and they do "simple" things (where "simple" is read with the understanding that computers are truly very fast nowadays).
The real question is not whether loops are good or bad, but whether you are able to identify when to use them, and know how to code them well. It is not about loops being good or bad, but rather you being good or bad with them.
Also, test. See how many iterations of an empty loop you can perform before your cycle time suffers. Try it with non-empty loops that perform things you care about. You will learn about your true bottleneck quickly, and are likely to stop worrying about loops and focus on code inside the loop.
Related
I need to iterate nearly 200 objects using for loop may increase the response time is there any other way to iterate objects with less time
If
the time required to process each of those 200 items is already optimal and
caching results from previous executions of this loop is not an option
you might get a speedup from distributing the work to multiple threads, each iterating over a segment of the list.
To give you more help, you probably need to state your problem in more detail.
Don't guess about performance, measure it using a decent profiler.
I bet that the profiler will show the problem to be something you do with the 200 objects in the loop body, and not the loop iteration instructions.
The bare for loop itself, using the typical integer loop variable, should take at most a few microseconds for your 200 iterations. So, unless you're doing really high-rate real-time calculations, this doesn't matter.
For example, is there any practical1 performance differences between the following in any language:
for i=1 to 10:
print i
for i=1 to 10:
print i
for i=1 to 10:
print i
for i=1 to 10:
print i
for i=1 to 10:
print i
versus
for i=1 to (10 * 5):
print i%10
Obviously the task would typically be less trivial, but the point remains. If you have to iterate over a data set is there any advantage to doing all operations on that data in one pass versus repeatedly looping through the set?
1: I understand that there might be costs associated with repeatedly reallocating space. However, if the time is insignificant compared to any real life task then let's disregard it for the moment.
The short answer is it depends:
Depending on the actual task performed, readability might be improved by one approach over the other. This is a practical issue affecting correctness and maintainability. This should be you main concern.
Breaking down a large loop in smaller ones may increase cache efficiency. But cache sizes are rather large nowadays.
Breaking down the large loop into smaller ones may produce simpler expressions, as is obvious in your example, or fewer tests. You might see a improvement in the for the multiple loop case, but so small it should not be a compelling reason.
Combining small loops into a larger one may yield fewer comparisons and jumps, as is the case in your example, for a tiny improvement. But for your example, unrolling the small loops completely might be even more advantageous.
As always, for performance tuning, you have to perform benchmarks and compare timings for actual data. Unless you see a big improvement, choose the simplest, most readable and maintainable solution. Note that optimality is a temporary situation, any changes in the environment, technology, data quantity and characteristics may impact the performance of any solution.
This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Bin
Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).
(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.
Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?
Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.
The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.
None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.
Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.
http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.
While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?
i have learned that a program is measured by it's complexity - i mean by Big O Notation.
why don't we measure it by it's absolute running time?
thanks :)
You use the complexity of an algorithm instead of absolute running times to reason about algorithms, because the absolute running time of a program does not only depend on the algorithm used and the size of the input. It also depends on the machine it's running on, various implementations detail and what other programs are currently using system resources. Even if you run the same application twice with the same input on the same machine, you won't get exactly the same time.
Consequently when given a program you can't just make a statement like "this program will take 20*n seconds when run with an input of size n" because the program's running time depends on a lot more factors than the input size. You can however make a statement like "this program's running time is in O(n)", so that's a lot more useful.
Absolute running time is not an indicator of how the algorithm grows with different input sets. It's possible for a O(n*log(n)) algorithm to be far slower than an O(n^2) algorithm for all practical datasets.
Running time does not measure complexity, it only measures performance, or the time required to perform the task. An MP3 player will run for the length of the time require to play the song. The elapsed CPU time may be more useful in this case.
One measure of complexity is how it scales to larger inputs. This is useful for planning the require hardware. All things being equal, something that scales relatively linearly is preferable to one which scales poorly. Things are rarely equal.
The other measure of complexity is a measure of how simple the code is. The code complexity is usually higher for programs with relatively linear performance complexity. Complex code can be costly maintain, and changes are more likely to introduce errors.
All three (or four) measures are useful, and none of them are highly useful by themselves. The three together can be quite useful.
The question could use a little more context.
In programming a real program, we are likely to measure the program's running time. There are multiple potential issues with this though
1. What hardware is the program running on? Comparing two programs running on different hardware really doesn't give a meaningful comparison.
2. What other software is running? If anything else running, it's going to steal CPU cycles (or whatever other resource your program is running on).
3. What is the input? As already said, for a small set, a solution might look very fast, but scalability goes out the door. Also, some inputs are easier than others. If as a person, you hand me a dictionary and ask me to sort, I'll hand it right back and say done. Giving me a set of 50 cards (much smaller than a dictionary) in random order will take me a lot longer to do.
4. What is the starting conditions? If your program runs for the first time, chances are, spinning it off the hard disk will take up the largest chunk of time on modern systems. Comparing two implementations with small inputs will likely have their differences masked by this.
Big O notation covers a lot of these issues.
1. Hardware doesn't matter, as everything is normalized by the speed of 1 operation O(1).
2. Big O talks about the algorithm free of other algorithms around it.
3. Big O talks about how the input will change the running time, not how long one input takes. It tells you the worse the algorithm will perform, not how it performs on an average or easy input.
4. Again, Big O handles algorithms, not programs running in a physical system.