How to measure? - performance

when I did performance-tuning, I will first to work in the high-level and try to answer is this cpu-bound or IO-bound?
when I make sure this is the cpu-bound, then I will try to find hotspot by adding some timer code.This is good, but I failed to figure out these issues:
cache misses
thread context effect.
Is there any one knows how to measure these items?

Are you open to a different way of thinking about performance tuning?
It does not look at I/O vs CPU bound, hotspots, and timers.
First, think about just one thread. The execution of a thread is much like a tree. There is a main function (the trunk). There are points when subroutines are called (branches). There are terminal instructions (leaves) and blocking calls like I/O (fruit). The total time the program takes is the sum of all the leaves and all the fruit.
What you want to do is prune the tree, making it as light as possible, without killing it.
What many people do is weigh (time) the whole thing, and then weigh parts of it, and so on, and hope to find hotspots (leafy branches) that maybe they could trim.
Another way is 1) select some leaves or fruit at random. 2) from each leaf or fruit, paint a line from it along the branch it is on, all the way back to the trunk. 3) Take note of branches that have >1 lines painted on them. 4) Ask "Do I need this branch?". If you can prune it, do so. You will eliminate the entire weight of the branch, and you did it without weighing it. Then start over.
That's the idea behind random-pausing.
There are certain kinds of problems it will not find, but most of them it will find, quickly, including any that timing threads can find.

1) Use cachegrind/callgrind/kcachegrind
http://valgrind.org/info/tools.html#cachegrind
pretty useful in terms of analysing memory locality under specific sets of assumptions.
2) Threading is really painful to profile correctly. Play some with cpusets and process affinities, on modern NUMA systems it becomes critical quickly.

Related

Developing cluster apps

I'm not sure exactly where (or even how exactly to ask) this question, so I'm hoping someone here can point me in the right direction.
I have a service that I'm building. That service has different objects in memory - each with it's own state. Whenever an object is created it loads the state from the database and hold it. When changes are made to the object they are also persistent to the database.
I would like to scale this service. I have looked at solutions such as akka.net (actor model) and they have a clustering solution. From what I've read, it synchronizes the state with something they call "gossip" where each node sends the state to the other node. I'm not sure that it really possible to convert my working application to akka.net at this point.
I'm wondering exactly how clusters keep state synced between different nodes (I get the gossip concept), what happens if I have machine A that receives a message and at the same time, machine B also receives a message - both change the same state of an object - that will make problems with data integrity between states. My only thought about this is to lock a shared resource, but that defeats the purpose of the cluster.
Keeping state in the database is also not an option since the database becomes a bottleneck and a single point of failure.
I can't seem to find any relevant reading materials online - but I'm also lacking the technical phrases I need to focus on.
In case it's relevant, I'm using .NET Core and c# for development.
Can anyone explain the concept of clustering, how it works and make sure nodes are at sync? or can point to the right direction?
You have a big problem. I think that the way you are thinking about the problem is a bigger problem. Let's go through some basics.
Clustering is used to solve big problems, much like the "eat an elephant" problem. You could to solve this problem design a unique bigger predator with a huge mouth. But history and paleontology has shown us that big predators are not easily sustained (they are expensive on the environment).
So to solve your problem, you could take a bigger stronger server.
Or, you could use clustering.
Clustering solves the "eat the elephant" problem in a very different way. Instead of sending a unique huge predator with a huge mouth to eat the elephant, it will use a concept of distributed and shared processing to eat it one bite at a time. When done properly, ants could eat the elephant. If there are enough of them and the circumstances are correct.
But notice in my example, ants are very small... A single ant will never carry the entire elephant. You could carry the entire elephant if all the ants worked together but then you run into concurrency and locking problems (you must coordinate the ants).
Ants have shown us a much better way to deal with this. They will take a piece of the elephant and deal with the problem in smaller chunks.
In your system you ask how you can sync data across nodes... My question would be why? If you are syncing data then you are mirroring and your problem becomes even bigger (you are cloning the elephant but can only eat the original).
The solution to your problem is to rethink the solution and see if you can break down the problem into smaller pieces.
In Akka and in the Actor pattern the best way to deal with problems is to use smaller "processes" (a single ant). While the process on its own is almost useless, when used in a large scale they can become very powerful. When the architecture is properly done you will notice that taking a flamethrower to ants will not defeat them... More ants will come, they will continue to work on the problem.
Copying and syncing data is not your solution, clustering it is. You must take your data and break it down to a point where you can give it to a single ant. If you can do this then you can use Akka. If this approach seems ludicrous then Akka is not for you.
But consider this... You obviously have concerns over your database backend - you don't want to increase IO and introduce a single point of failure. I would have to agree with you. But you need to rethink things. You could have database mirroring to remove the single point of failure but you are correct that this won't remove the bottleneck. So let's say that mirror removes the single point of failure... Now let's attack the bottleneck portion.
If you can split up your data into small enough chunks that ants can handle it then I would urge you to tell your ants to only report to the database when the data changes... You can read it once on initialization (you need a backend store, don't kid yourself, electricity can be quickly lost... it must be saved somewhere) but if you tell your ants to persist only changed data then you will remove all the queries from the equation which will drastically shift where the load is coming from. Once you only have updates, inserts and deletes to deal with... the whole landscape will be much simpler.
Clustering should be the solution for you, but only if you can take the concept of mirror away from your mind.
Cluster nodes can and will crash... But they can be respawned elsewhere to other nodes, so that you always have a quick system. Only when you deal with a crash or loss of a node/worker process/ant will you have to reload data...
Good luck... you have outlined a formidable problem that I have seen people with software engineering degrees fail at solving.

Is there a parallel flood fill implementation?

I've got openMP and MPI at my disposal, and was wondering if anyone has come across a parallel version of any flood fill algorithm (preferably in c). If not, I'd be interested in sketches of how to do parallelise it - is it even possible given its based on recursion?
Wikipedia's got a pretty good article if you need to refresh your memory on flood fills.
Many thanks for your help.
There's nothing "inherently" recursive about flood-fill, just that to do some work, you need some information about previously-discovered "frontier" cells. If you think of it that way, it's clear that parallelism is eminently possible: even with a single queue, you could use four threads (one for each direction), and only move the tail of the queue when the cell has been examined by each thread. or equivalently, four queues. thinking in this way, one might even imagine partitioning the space into multiple queues - bucketed by coordinate ranges, perhaps.
one basic problem is that the problem definition usually includes the proviso that no cell is ever revisited. this implies that each worker needs an up-to-date map of which cells have been considered (globally). mutable global information is problematic, performance-wise, though it's not hard to think of ways to limit the necessity for propagating updates globally...

When timing how long a quick process runs, how many runs should be used?

Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?
Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.
The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.
None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.
Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.
http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.
While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?

Chess Optimizations

ok, so i have been working on my chess program for a while and i am beginning to hit a wall. i have done all of the standard optimizations (negascout, iterative deepening, killer moves, history heuristic, quiescent search, pawn position evaluation, some search extensions) and i'm all out of ideas!
i am looking to make it multi-threaded soon, and that should give me a good boost in performance, but aside from that are there any other nifty tricks you guys have come across? i have considered switching to MDF(f), but i have heard it is a hassle and isn't really worth it.
what i would be most interested in is some kind of learning algorithm, but i don't know if anyone has done that effectively with a chess program yet.
also, would switching to a bit board be significant? i currently am using 0x88.
Over the last year of development of my chess engine (www.chessbin.com), much of the time has been spent optimizing my code to allow for better and faster move searching. Over that time I have learned a few tricks that I would like to share with you.
Measuring Performance
Essentially you can improve your performance in two ways:
Evaluate your nodes faster
Search fewer nodes to come up with
the same answer
Your first problem in code optimization will be measurement. How do you know you have really made a difference? In order to help you with this problem you will need to make sure you can record some statistics during your move search. The ones I capture in my chess engine are:
Time it took for the search to
complete.
Number of nodes searched
This will allow you to benchmark and test your changes. The best way to approach testing is to create several save games from the opening position, middle game and the end game. Record the time and number of nodes searched for black and white.
After making any changes I usually perform tests against the above mentioned save games to see if I have made improvements in the above two matrices: number of nodes searched or speed.
To complicate things further, after making a code change you might run your engine 3 times and get 3 different results each time. Let’s say that your chess engine found the best move in 9, 10 and 11 seconds. That is a spread of about 20%. So did you improve your engine by 10%-20% or was it just varied load on your pc. How do you know? To fight this I have added methods that will allow my engine to play against itself, it will make moves for both white and black. This way you can test not just the time variance over one move, but a series of as many as 50 moves over the course of the game. If last time the game took 10 minutes and now it takes 9, you probably improved your engine by 10%. Running the test again should confirm this.
Finding Performance Gains
Now that we know how to measure performance gains lets discuss how to identify potential performance gains.
If you are in a .NET environment then the .NET profiler will be your friend. If you have a Visual Studio for Developers edition it comes built in for free, however there are other third party tools you can use. This tool has saved me hours of work as it will tell you where your engine is spending most of its time and allow you to concentrate on your trouble spots. If you do not have a profiler tool you may have to somehow log the time stamps as your engine goes through different steps. I do not suggest this. In this case a good profiler is worth its weight in gold. Red Gate ANTS Profiler is expensive but the best one I have ever tried. If you can’t afford one, at least use it for their 14 day trial.
Your profiler will surly identify things for you, however here are some small lessons I have learned working with C#:
Make everything private
Whatever you can’t make private, make
it sealed
Make as many methods static as
possible.
Don’t make your methods chatty, one
long method is better than 4 smaller
ones.
Chess board stored as an array [8][8]
is slower then an array of [64]
Replace int with byte where possible.
Return from your methods as early as
possible.
Stacks are better than lists
Arrays are better than stacks and
lists.
If you can define the size of the
list before you populate it.
Casting, boxing, un-boxing is evil.
Further Performance Gains:
I find move generation and ordering is extremely important. However here is the problem as I see it. If you evaluate the score of each move before you sort and run Alpha Beta, you will be able to optimize your move ordering such that you will get extremely quick Alpha Beta cutoffs. This is because you will be able to mostly try the best move first.
However the time you have spent evaluating each move will be wasted. For example you might have evaluated the score on 20 moves, sort your moves try the first 2 and received a cut-off on move number 2. In theory the time you have spent on the other 18 moves was wasted.
On the other hand if you do a lighter and much faster evaluation say just captures, your sort will not be that good and you will have to search more nodes (up to 60% more). On the other hand you would not do a heavy evaluation on every possible move. As a whole this approach is usually faster.
Finding this perfect balance between having enough information for a good sort and not doing extra work on moves you will not use, will allow you to find huge gains in your search algorithm. Furthermore if you choose the poorer sort approach you will want to first to a shallower search say to ply 3, sort your move before you go into the deeper search (this is often called Iterative Deepening). This will significantly improve your sort and allow you to search much fewer moves.
Answering an old question.
Assuming you already have a working transposition table.
Late Move Reduction. That gave my program about 100 elo points and it is very simple to implement.
In my experience, unless your implementation is very inefficient, then the actual board representation (0x88, bitboard, etc.) is not that important.
Although you can criple you chess engine with bad performance, a lightning fast move generator in itself is not going to make a program good.
The search tricks used and the evaluation function are the overwhelming factors determining overall strength.
And the most important parts, by far, of the evaluation are Material, Passed pawns, King Safety and Pawn Structure.
The most important parts of the search are: Null Move Pruning, Check Extension and Late Move reduction.
Your program can come a long, long way, on these simple techniques alone!
Good move ordering!
An old question, but same techniques apply now as for 5 years ago. Aren't we all writing our own chess engines, I have my own called "Norwegian Gambit" that I hope will eventually compete with other Java engines on the CCRL. I as many others use Stockfish for ideas since it is so nicely written and open. Their testing framework Fishtest and it's community also gives a ton of good advice. It is worth comparing your evaluation scores with what Stockfish gets since how to evaluate is probably the biggest unknown in chess-programming still and Stockfish has gone away from many traditional evals which have become urban legends (like the double bishop bonus). The biggest difference however was after I implemented the same techniques as you mention, Negascout, TT, LMR, I started using Stockfish for comparison and I noticed how for the same depth Stockfish had much less moves searched than I got (because of the move ordering).
Move ordering essentials
The one thing that is easily forgotten is good move-ordering. For the Alpha Beta cutoff to be efficient it is essential to get the best moves first. On the other hand it can also be time-consuming so it is essential to do it only as necessary.
Transposition table
Sort promotions and good captures by their gain
Killer moves
Moves that result in check on opponent
History heuristics
Silent moves - sort by PSQT value
The sorting should be done as needed, usually it is enough to sort the captures, and thereafter you could run the more expensive sorting of checks and PSQT only if needed.
About Java/C# vs C/C++/Assembly
Programming techniques are the same for Java as in the excellent answer by Adam Berent who used C#. Additionally to his list I would mention avoiding Object arrays, rather use many arrays of primitives, but contrary to his suggestion of using bytes I find that with 64-bit java there's little to be saved using byte and int instead of 64bit long. I have also gone down the path of rewriting to C/C++/Assembly and I am having no performance gain whatsoever. I used assembly code for bitscan instructions such as LZCNT and POPCNT, but later I found that Java 8 also uses those instead of the methods on the Long object. To my surprise Java is faster, the Java 8 virtual machine seems to do a better job optimizing than a C compiler can do.
I know that one improvement that was talked about at the AI courses in university where having a huge database of finishing moves. So having a precalculated database for games with only a small number of figures left. So that if you hit a near end positioning in your search you stop the search and take a precalculated value that improves your search results like extra deepening that you can do for important/critique moves without much computation time spend. I think it also comes with a change in heuristics in a late game state but I'm not a chess player so I don't know the dynamics of game finishing.
Be warned, getting game search right in a threaded environment can be a royal pain (I've tried it). It can be done, but from some literature searching I did a while back, it's extremely hard to get any speed boost at all out of it.
Its quite an old question, I was just searching questions on chess and found this one unanswered. Well it may not be of any help to you now, but may prove helpful to other users.
I didn't see null move pruning, transposition tables.. are you using them? They would give you a big boost...
One thing that gave me a big boost was minimizing conditional branching... Alot of things can be precomputed. Search for such opportunities.
Most modern PCs have multiple cores so it would be a good idea making it multithreading. You don't necessarily need to go MDF(f) for that.
I wont suggest moving your code to bitboard. Its simply too much work. Even though bitboards could give a boost on 64 bit machines.
Finally and most importantly chess literature dominates any optimizations we may use. optimization is too much work. Look at open source chess engines, particularly crafty and fruit/toga. Fruit used to be open source initially.
Late answer, but this may help someone:
Given all the optimizations you mentioned, 1450 ELO is very low. My guess is that something is very wrong with your code. Did you:
Wrote a perft routine and ran it through a set of positions? All tests should pass, so you know your move generator is free of bugs. If you don't have this, there's no point in talking about ELO.
Wrote a mirrorBoard routine and ran the evaluation code through a set of positions? The result should be the same for the normal and mirrored positions, otherwise you have a bug in your eval.
Do you have a hashtable (aka transposition table)? If not, this is a must. It will help while searching and ordering moves, giving a brutal difference in speed.
How do you implement move ordering? This links back to point 3.
Did you implement the UCI protocol? Is your move parsing function working properly? I had a bug like this in my engine:
/* Parses a uci move string and return a Board object */
Board parseUCIMoves(String moves)// e2e4 c7c5 g1f3 ...{
//...
if (someMove.equals("e1g1") || someMove.equals("e1c1"))
//apply proper castle
//...
}
Sometimes the engine crashed while playing a match, and I thought it was the GUI fault, since all perft tests were ok. It took me one week to find the bug by luck. So, test everything.
For (1) you can search every position to depth 6. I use a file with ~1000 positions. See here https://chessprogramming.wikispaces.com/Perft
For (2) you just need a file with millions of positions (just the FEN string).
Given all the above and a very basic evaluation function (material, piece square tables, passed pawns, king safety) it should play at +-2000 ELO.
As far as tips, I know large gains can be found in optimizing your move generation routines before any eval functions. Making that function as tight as possible can give you 10% or more in nodes/sec improvement.
If you're moving to bitboards, do some digging on rec.games.chess.computer archives for some of Dr. Robert Hyatts old posts about Crafty (pretty sure he doesn't post anymore). Or grab the latest copy from his FTP and start digging. I'm pretty sure it would be a significant shift for you though.
Transposition Table
Opening Book
End Game Table Bases
Improved Static Board Evaluation for Leaf Nodes
Bitboards for Raw Speed
Profile and benchmark. Theoretical optimizations are great, but unless you are measuring the performance impact of every change you make, you won't know whether your work is improving or worsening the speed of the final code.
Try to limit the penalty to yourself for trying different algorithms. Make it easy to test various implementations of algorithms against one another. i.e. Make it easy to build a PVS version of your code as well as a NegaScout version.
Find the hot spots. Refactor. Rewrite in assembly if necessary. Repeat.
Assuming "history heuristic" involves some sort of database of past moves, a learning algorithm isn't going to give you much more unless it plays a lot of games against the same player. You can probably achieve more by classifying a player and tweaking the selection of moves from your historic database.
It's been a long time since I've done any programming on any chess program, but at the time, bit boards did give a real improvement. Other than that I can't give you much advise. Do you only evaluate the position of pawns? Some (slight) bonuses for position or mobility of some key pieces may be in order.
I'm not certain what type of thing you would like it to learn however...

How to detect anomalous resource consumption reliably?

This question is about a whole class of similar problems, but I'll ask it as a concrete example.
I have a server with a file system whose contents fluctuate. I need to monitor the available space on this file system to ensure that it doesn't fill up. For the sake of argument, let's suppose that if it fills up, the server goes down.
It doesn't really matter what it is -- it might, for example, be a queue of "work".
During "normal" operation, the available space varies within "normal" limits, but there may be pathologies:
Some other (possibly external)
component that adds work may run out
of control
Some component that removes work seizes up, but remains undetected
The statistical characteristics of the process are basically unknown.
What I'm looking for is an algorithm that takes, as input, timed periodic measurements of the available space (alternative suggestions for input are welcome), and produces as output, an alarm when things are "abnormal" and the file system is "likely to fill up". It is obviously important to avoid false negatives, but almost as important to avoid false positives, to avoid numbing the brain of the sysadmin who gets the alarm.
I appreciate that there are alternative solutions like throwing more storage space at the underlying problem, but I have actually experienced instances where 1000 times wasn't enough.
Algorithms which consider stored historical measurements are fine, although on-the-fly algorithms which minimise the amount of historic data are preferred.
I have accepted Frank's answer, and am now going back to the drawing-board to study his references in depth.
There are three cases, I think, of interest, not in order:
The "Harrods' Sale has just started" scenario: a peak of activity that at one-second resolution is "off the dial", but doesn't represent a real danger of resource depletion;
The "Global Warming" scenario: needing to plan for (relatively) stable growth; and
The "Google is sending me an unsolicited copy of The Index" scenario: this will deplete all my resources in relatively short order unless I do something to stop it.
It's the last one that's (I think) most interesting, and challenging, from a sysadmin's point of view..
If it is actually related to a queue of work, then queueing theory may be the best route to an answer.
For the general case you could perhaps attempt a (multiple?) linear regression on the historical data, to detect if there is a statistically significant rising trend in the resource usage that is likely to lead to problems if it continues (you may also be able to predict how long it must continue to lead to problems with this technique - just set a threshold for 'problem' and use the slope of the trend to determine how long it will take). You would have to play around with this and with the variables you collect though, to see if there is any statistically significant relationship that you can discover in the first place.
Although it covers a completely different topic (global warming), I've found tamino's blog (tamino.wordpress.com) to be a very good resource on statistical analysis of data that is full of knowns and unknowns. For example, see this post.
edit: as per my comment I think the problem is somewhat analogous to the GW problem. You have short term bursts of activity which average out to zero, and long term trends superimposed that you are interested in. Also there is probably more than one long term trend, and it changes from time to time. Tamino describes a technique which may be suitable for this, but unfortunately I cannot find the post I'm thinking of. It involves sliding regressions along the data (imagine multiple lines fitted to noisy data), and letting the data pick the inflection points. If you could do this then you could perhaps identify a significant change in the trend. Unfortunately it may only be identifiable after the fact, as you may need to accumulate a lot of data to get significance. But it might still be in time to head off resource depletion. At least it may give you a robust way to determine what kind of safety margin and resources in reserve you need in future.

Resources