I need to iterate nearly 200 objects using for loop may increase the response time is there any other way to iterate objects with less time
If
the time required to process each of those 200 items is already optimal and
caching results from previous executions of this loop is not an option
you might get a speedup from distributing the work to multiple threads, each iterating over a segment of the list.
To give you more help, you probably need to state your problem in more detail.
Don't guess about performance, measure it using a decent profiler.
I bet that the profiler will show the problem to be something you do with the 200 objects in the loop body, and not the loop iteration instructions.
The bare for loop itself, using the typical integer loop variable, should take at most a few microseconds for your 200 iterations. So, unless you're doing really high-rate real-time calculations, this doesn't matter.
Related
I am using Beckhoff PLC(Structured Text language) for a while and trying not to use for loops in all algorithm to make sure everything works in real time and won't get stuck in any loop. Time is so important for algorithms that I am working on. Any line in the code finishes in 1 ms. If I use for loop in somewhere I was thinking that when it comes into for loop, it will take 1ms for every iteration so the rest software will stay untill it finishes the loop.
For this reason I haven't writing any loop but one of my colleagues implemented his softwares with for loop. Loop starts from 1 to 100000 and every 1ms, counter increases 100000 and other counters increasing 1ms which are out of the loop.
The procedure that I know is, If PLC has time after handle everything below 1ms, it cares the rest which might be diagnosis check or IO check or whatever.
So how it can handle 100000 iteration while I identify cycle time is 1ms. What if I use 1 million different loop in program. It can handle for loop also because the program is so small. Would it behave same while software is huge? How I can understand it can finish all loops without causing overflow in cycle time?
I will try and see but for gaining insights I would like to discuss with you.
There is no universal answer to the question of whether processing can remain within your expected/required/preferred scan cycle time.
Computing power is bounded. Each iteration of a loop takes non-zero computing power so, of course, there can be more loops (and/or iterations of loops) to process than what your PLC can handle. There is no magic in your PLC that makes looping "free". On the other hand, looping is, generally speaking, very fast (compared to the processing that is performed inside the loop).
Loops are inherently sub-optimal only if they are wasteful. Replacing a loop that performs N iterations is not computationally worse than coding N times the same code without a loop.
Signs that your loop may be wasteful :
It is used only to wait for something or poll some data
It repeatedly does the same thing over the same data
It uses no matching data structure (such as an array) that you iterate over
In a PLC, you need to make sure that your loops perform processing that reliably fits within your cycle time. The worst case is what you care about : if your loop takes 0.01ms 99.9% of the time, but the last 0.1% requires 1 second, it will not work.
If a loop is what best fits an algorithm (in terms of algorithm efficiency, clarity, maintainability), it is likely to be either the fastest (or very close to the fastest) method to express your solution. You should only be afraid of loops if you code wasteful loops.
In the end, if your PLC is not powerful enough to do everything that needs to be done even with a well-coded solution, the presence or absence of loops is not the problem or the solution. But assume your PLC is fast. Very fast. I run programs with hundreds of thousands of variables, many loops executing every cycle, and am getting cycle times below 0.1ms on a Raspberry Pi. I never avoid loops for anything that is "loopy" by nature, but my loops never block waiting for something, they never allocate memory or perform processing that will introduce meaningful variation in execution time, and they do "simple" things (where "simple" is read with the understanding that computers are truly very fast nowadays).
The real question is not whether loops are good or bad, but whether you are able to identify when to use them, and know how to code them well. It is not about loops being good or bad, but rather you being good or bad with them.
Also, test. See how many iterations of an empty loop you can perform before your cycle time suffers. Try it with non-empty loops that perform things you care about. You will learn about your true bottleneck quickly, and are likely to stop worrying about loops and focus on code inside the loop.
For benchmarking the efficiency of different algorithms for a simple task and comparing them, the way I find most often is to set a constant number of times to iterate over the task, and measure the time interval spent for each algorithm.
But, if the number of times is set too small, the interval difference among the algorithms will be too small, and may be masked by external factors. If you set the number of times too large, then it will take too much time to execute. So you have to guess the right number of times by trial end error.
Rather than doing it this way, I think it makes more sense to set a constant time interval that you want to run each algorithm, and then measure how many iterations can be made in that interval for each algorithm.
By doing so, the reliability of the benchmark will be more stable. In the conventional way, the benchmark will be more reliable for tasks that take time.
I haven't seen this way of benchmarking. Do people actually do this way, and is there a benchmarking framework for this way of measurement? I am asking this as a non-language-specific question, but if there is such framework, especially for Ruby, please introduce some. Or am I wrong about this idea?
I found this gem: benchmark/ips.
Take a look at perfer:
https://github.com/jruby/perfer
This has a couple of mechanisms including iterations/s. Don't worry that this is a jruby repo, it works on all Ruby implementations and was written as part of GSoC 2012.
This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Bin
Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).
(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.
I want to compare the performance of two search trees of integers (an AVL tree vs a RedBlack tree). So how should I design/engineer the tests to accomplish this ? For instance, let's consider the insert operation, what steps should I follow in order to state that on average this operation is faster in the RB case ? Should I time inserting just one element (assuming the trees are pre-populated) or should I time a sequence of insertions ? Also what considerations should I take to correctly measure CPU time accurately ?
Thanks in advance.
This is a really broad question, and as such, I don't think you should be hoping for anybody to get on here and give you the one final correct answer regarding how to measure performance. That being said...
First, you should develop a suite of tests. Two popular techniques exist for doing this: monitor a real-world sequence of operations done by an application (so, find some open source application that uses either an AVL or RB tree, and add some code to print out the sequence of operations it performs) or create such a stream of operations analytically (or synthetically) to target any number of cases (the average usage, particular kinds of abnormal or otherwise unusual usage, random usage, etc.). The more of these kinds of traces you get to test, the better.
Once you have your set of traces to test, you need to develop a driver to do the evaluation. The driver should be simple, the same for both AVL and RB trees (I think that in this case, this shouldn't be a problem; both present the same interface to users, differing only in terms of internal implementation details). The driver should be able to reproduce the usage recorded in your trace sets efficiently and cause the traced operations to be carried out on your data structures. One thing I like to do is to include a third "dummy" candidate that does nothing; this way, I can see how much of an influence the processing of traces is exerting on overall performance.
Each trace should be executed many, many times. You can formalize this somewhat (to reduce statistical uncertainty to within known bounds), but a rule of thumb is that the order of your error will shrink according to 1/sqrt(n), where n is the number of trials. In other words, by running each trace 10,000 times instead of 100 times, you will get errors in the average that are 10x smaller. Record all values; things to look for are the mean, median, mode(s), etc. For each run, try to keep the system conditions the same; no other programs running, etc. To help eliminate spurious results due to external factors changing, you can cull the bottom and top 10% of outliers...
Now, simply compare the data sets. Perhaps what you care most about is the average time the trace takes? Perhaps the worst? Maybe what you really care about is consistency; is the standard deviation big or small? You should have enough data to compare the results for a given trace executed on both test structures; and for different traces, it might make more sense to look at different figures (for instance, if you created a synthetic benchmark that should be the worst case for RB trees, you might ask how badly RB and AVL trees did, whereas you might not care about this for another trace representing the best case for AVL trees, etc.)
Timing on the CPU can be a challenge in its own right. You'll need to ensure that the resolution of your timer is sufficient for measuring your events. clock() and gettimeofday() functions - and others - are popular choices for recording the time of events. If your traces finish too quickly, you can get the aggregate time for several trials (so that if your timer supports microsecond timing and your traces finish in 10 microseconds, you can measure 100 executions of the trace instead of 1, and get time values on 10s of milliseconds, which should be accurate).
Another potential pitfall is providing the same execution environment each time. In between trace runs, at the very least, you might consider techniques for ensuring that you start with a clean cache. Either that, or don't time the first execution, or understand that this result might be culled when you eliminate outliers. It might be safer to just reset the cache (by manipulating every element of some large array, for instance in between executions of traces), since code A might benefit from having some of the values in cache while code B might suffer.
These are a few of the things you might consider when doing your own performance evaluation. Other tools - like PAPI and other profilers, for instance - can measure certain events - cache hits/misses, instructions, etc. - and this information can allow for much richer comparisons than simple comparisons of wall-clock run time.
Measuring CPU time accurately can be very tricky depending on your particular programming language, implementation, etc. For example, with Java's JIT compilation, the results can be extremely different depending on how much you've run the code before now!
Can you give more detail about your situation?
Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?
Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.
The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.
None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.
Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.
http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.
While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?