How to visualize profile files graphically? - go

I'm developing Go 1.2 on Windows 8.1 64 bit. I had many issues getting the go pprof tool to work properly such as memory addresses being displayed instead of actual function names.
However, i found profile which seems to do a great job at producing profile files, which work with the pprof tool. My guestion is, how do i use those profile files for graphical visualization?

U can try go tool pprof /path/to/program profile.prof to solve function not true problem.
if u want graphical visualization, try input web in pprof.

If your goal is to see pretty but basically meaningless pictures, go for visualization as #Specode suggested.
If your goal is speed, then I recommend you forget visualization.
Visualization does not tell you what you need to fix.
This method does tell you what to fix.
You can do it quite effectively in GDB.
EDIT in response to #BurntSushi5:
Here are my "gripes with graphs" :)
In the first place, they are super easy to fool.
For example, suppose A1 spends all its time calling C2, and vice-versa.
Then suppose a new routine B is inserted, such that when A1 calls B, B calls C2, and when A2 calls B, B calls C1.
The graph loses the information that every time C2 is called, A1 is above it on the stack, and vice-versa.
For another example, suppose every call to C is from A.
Then suppose instead A "dispatches" to a bunch of functions B1, B2, ..., each of which calls C.
The graph loses the information that every call to C comes through A.
Now to the graph that was linked:
It places great emphasis on self time, making giant boxes, when inclusive time is far more important. (In fact, the whole reason gprof was invented was because self time was about as useful as a clock with only a second-hand.) They could at least have scaled the boxes by inclusive time.
It says nothing about the lines of code that the calls come from, or that are spending the self time. It's based on the assumption that all functions should be small. Maybe that's true, and maybe not, but is it a good enough reason for the profile output to be unhelpful?
It is chock-full of little boxes that don't matter because their time is insignificant. All they do is take up gobs of real estate and distract you.
There's nothing in there about I/O. The profiler from which the graph came apparently embodies that the only I/O is necessary I/O, so there's no need to profile it (even if it takes 90% of the time). In big programs, it's really easy for I/O to be done that isn't really necessary, taking a big fraction of time, and so-called "CPU profilers" have the prejudice that it doesn't even exist.
There doesn't seem to be any instance of recursion in that graph, but recursion is common, and useful, and graphs have difficulty displaying it with meaningful measurements.
Just pointing out that, if a small number of stack samples are taken, roughly half of them would look like this:
blah-blah-couldn't-read-it
blah-blah-couldn't-read-it
blah-blah-couldn't-read-it
fragbag.(*structureAtoms).BestStructureFragment
structure.RMSDMem
... a couple other routines
The other half of the samples are doing something else, equally informative.
Since each stack sample shows you the lines of code where the calls come from, you're actually being told why the time is being spent.
(Activities that don't take much time have very small likelihood of being sampled, which is good, because you don't care about those.)
Now I don't know this code, but the graph gives me a strong suspicion that, like a lot of code I see, the devil's in the data structure.

Related

My Algorithm only fails for large values - How do I debug this?

I'm working on transcribing as3delaunay to Objective-C. For the most part, the entire algorithm works and creates graphs exactly as they should be. However, for large values (thousands of points), the algorithm mostly works, but creates some incorrect graphs.
I've been going back through and checking the most obvious places for error, and I haven't been able to actually find anything. For smaller values I ran the output of the original algorithm and placed it into JSON files. I then read that output in to my own tests (tests with 3 or 4 points only), and debugged until the output matched; I checked the output of the two algorithms line for line, and found the discrepancies. But I can't feasibly do that for 1000 points.
Answers don't need to be specific to my situation (although suggesting tools I can use would be excellent).
How can I debug algorithms that only fail for large values?
If you are transcribing an existing algorithm to Objective-C, do you have a working original in some other language? In that case, I would be inclined to put in print statements in both versions and debug the first discrepancy (the first, because later discrepancies could be knock-on errors).
I think it is very likely that the program also makes mistakes for smaller graphs, but more rarely. My first step would in fact be to use the working original (or some other means) to run a large number of automatically checked test runs on small graphs, hoping to find the bug on some more manageable input size.
Find the threshold
If it works for 3 or 4 items, but not for 1000, then there's probably some threshold in between. Use a binary search to find that threshold.
The threshold itself may be a clue. For example, maybe it corresponds to a magic value in the algorithm or to some other value you wouldn't expect to be correlated. For example, perhaps it's a problem when the number of items exceeds the number of pixels in the x direction of the chart you're trying to draw. The clue might be enough to help you solve the problem. If not, it may give you a clue as to how to force the problem to happen with a smaller value (e.g., debug it with a very narrow chart area).
The threshold may be smaller than you think, and may be directly debuggable.
If the threshold is a big value, like 1000. Perhaps you can set a conditional breakpoint to skip right to iteration 999, and then single-step from there.
There may not be a definite threshold, which suggests that it's not the magnitude of the input size, but some other property you should be looking at (e.g., powers of 10 don't work, but everything else does).
Decompose the problem and write unit tests
This can be tedious but is often extremely valuable--not just for the current issue, but for the future. Convince yourself that each individual piece works in isolation.
Re-visit recent changes
If it used to work and now it doesn't, look at the most recent changes first. Source control tools are very useful in helping you remember what has changed recently.
Remove code and add it back piece by piece
Comment out as much code as you can and still get some kind of reasonable output (even if that output doesn't meet all the requirements). For example, instead of using a complicated rounding function, just truncate values. Comment out code that adds decorative touches. Put assert(false) in any special case handlers you don't think should be activated for the test data.
Now verify that output, and slowly add back the functionality you removed, one baby step at a time. Test thoroughly at each step.
Profile the code
Profiling is usually for optimization, but it can sometimes give you insight into code, especially when the data size is too large for single-stepping through the debugger. I like to use line or statement counts. Is the loop body executing the number of times you expect? Or twice as often? Or not at all? How about the then and else clauses of those if statements? Logic bugs often become very obvious with this type of profiling.

When timing how long a quick process runs, how many runs should be used?

Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?
Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.
The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.
None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.
Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.
http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.
While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?

Chess Optimizations

ok, so i have been working on my chess program for a while and i am beginning to hit a wall. i have done all of the standard optimizations (negascout, iterative deepening, killer moves, history heuristic, quiescent search, pawn position evaluation, some search extensions) and i'm all out of ideas!
i am looking to make it multi-threaded soon, and that should give me a good boost in performance, but aside from that are there any other nifty tricks you guys have come across? i have considered switching to MDF(f), but i have heard it is a hassle and isn't really worth it.
what i would be most interested in is some kind of learning algorithm, but i don't know if anyone has done that effectively with a chess program yet.
also, would switching to a bit board be significant? i currently am using 0x88.
Over the last year of development of my chess engine (www.chessbin.com), much of the time has been spent optimizing my code to allow for better and faster move searching. Over that time I have learned a few tricks that I would like to share with you.
Measuring Performance
Essentially you can improve your performance in two ways:
Evaluate your nodes faster
Search fewer nodes to come up with
the same answer
Your first problem in code optimization will be measurement. How do you know you have really made a difference? In order to help you with this problem you will need to make sure you can record some statistics during your move search. The ones I capture in my chess engine are:
Time it took for the search to
complete.
Number of nodes searched
This will allow you to benchmark and test your changes. The best way to approach testing is to create several save games from the opening position, middle game and the end game. Record the time and number of nodes searched for black and white.
After making any changes I usually perform tests against the above mentioned save games to see if I have made improvements in the above two matrices: number of nodes searched or speed.
To complicate things further, after making a code change you might run your engine 3 times and get 3 different results each time. Let’s say that your chess engine found the best move in 9, 10 and 11 seconds. That is a spread of about 20%. So did you improve your engine by 10%-20% or was it just varied load on your pc. How do you know? To fight this I have added methods that will allow my engine to play against itself, it will make moves for both white and black. This way you can test not just the time variance over one move, but a series of as many as 50 moves over the course of the game. If last time the game took 10 minutes and now it takes 9, you probably improved your engine by 10%. Running the test again should confirm this.
Finding Performance Gains
Now that we know how to measure performance gains lets discuss how to identify potential performance gains.
If you are in a .NET environment then the .NET profiler will be your friend. If you have a Visual Studio for Developers edition it comes built in for free, however there are other third party tools you can use. This tool has saved me hours of work as it will tell you where your engine is spending most of its time and allow you to concentrate on your trouble spots. If you do not have a profiler tool you may have to somehow log the time stamps as your engine goes through different steps. I do not suggest this. In this case a good profiler is worth its weight in gold. Red Gate ANTS Profiler is expensive but the best one I have ever tried. If you can’t afford one, at least use it for their 14 day trial.
Your profiler will surly identify things for you, however here are some small lessons I have learned working with C#:
Make everything private
Whatever you can’t make private, make
it sealed
Make as many methods static as
possible.
Don’t make your methods chatty, one
long method is better than 4 smaller
ones.
Chess board stored as an array [8][8]
is slower then an array of [64]
Replace int with byte where possible.
Return from your methods as early as
possible.
Stacks are better than lists
Arrays are better than stacks and
lists.
If you can define the size of the
list before you populate it.
Casting, boxing, un-boxing is evil.
Further Performance Gains:
I find move generation and ordering is extremely important. However here is the problem as I see it. If you evaluate the score of each move before you sort and run Alpha Beta, you will be able to optimize your move ordering such that you will get extremely quick Alpha Beta cutoffs. This is because you will be able to mostly try the best move first.
However the time you have spent evaluating each move will be wasted. For example you might have evaluated the score on 20 moves, sort your moves try the first 2 and received a cut-off on move number 2. In theory the time you have spent on the other 18 moves was wasted.
On the other hand if you do a lighter and much faster evaluation say just captures, your sort will not be that good and you will have to search more nodes (up to 60% more). On the other hand you would not do a heavy evaluation on every possible move. As a whole this approach is usually faster.
Finding this perfect balance between having enough information for a good sort and not doing extra work on moves you will not use, will allow you to find huge gains in your search algorithm. Furthermore if you choose the poorer sort approach you will want to first to a shallower search say to ply 3, sort your move before you go into the deeper search (this is often called Iterative Deepening). This will significantly improve your sort and allow you to search much fewer moves.
Answering an old question.
Assuming you already have a working transposition table.
Late Move Reduction. That gave my program about 100 elo points and it is very simple to implement.
In my experience, unless your implementation is very inefficient, then the actual board representation (0x88, bitboard, etc.) is not that important.
Although you can criple you chess engine with bad performance, a lightning fast move generator in itself is not going to make a program good.
The search tricks used and the evaluation function are the overwhelming factors determining overall strength.
And the most important parts, by far, of the evaluation are Material, Passed pawns, King Safety and Pawn Structure.
The most important parts of the search are: Null Move Pruning, Check Extension and Late Move reduction.
Your program can come a long, long way, on these simple techniques alone!
Good move ordering!
An old question, but same techniques apply now as for 5 years ago. Aren't we all writing our own chess engines, I have my own called "Norwegian Gambit" that I hope will eventually compete with other Java engines on the CCRL. I as many others use Stockfish for ideas since it is so nicely written and open. Their testing framework Fishtest and it's community also gives a ton of good advice. It is worth comparing your evaluation scores with what Stockfish gets since how to evaluate is probably the biggest unknown in chess-programming still and Stockfish has gone away from many traditional evals which have become urban legends (like the double bishop bonus). The biggest difference however was after I implemented the same techniques as you mention, Negascout, TT, LMR, I started using Stockfish for comparison and I noticed how for the same depth Stockfish had much less moves searched than I got (because of the move ordering).
Move ordering essentials
The one thing that is easily forgotten is good move-ordering. For the Alpha Beta cutoff to be efficient it is essential to get the best moves first. On the other hand it can also be time-consuming so it is essential to do it only as necessary.
Transposition table
Sort promotions and good captures by their gain
Killer moves
Moves that result in check on opponent
History heuristics
Silent moves - sort by PSQT value
The sorting should be done as needed, usually it is enough to sort the captures, and thereafter you could run the more expensive sorting of checks and PSQT only if needed.
About Java/C# vs C/C++/Assembly
Programming techniques are the same for Java as in the excellent answer by Adam Berent who used C#. Additionally to his list I would mention avoiding Object arrays, rather use many arrays of primitives, but contrary to his suggestion of using bytes I find that with 64-bit java there's little to be saved using byte and int instead of 64bit long. I have also gone down the path of rewriting to C/C++/Assembly and I am having no performance gain whatsoever. I used assembly code for bitscan instructions such as LZCNT and POPCNT, but later I found that Java 8 also uses those instead of the methods on the Long object. To my surprise Java is faster, the Java 8 virtual machine seems to do a better job optimizing than a C compiler can do.
I know that one improvement that was talked about at the AI courses in university where having a huge database of finishing moves. So having a precalculated database for games with only a small number of figures left. So that if you hit a near end positioning in your search you stop the search and take a precalculated value that improves your search results like extra deepening that you can do for important/critique moves without much computation time spend. I think it also comes with a change in heuristics in a late game state but I'm not a chess player so I don't know the dynamics of game finishing.
Be warned, getting game search right in a threaded environment can be a royal pain (I've tried it). It can be done, but from some literature searching I did a while back, it's extremely hard to get any speed boost at all out of it.
Its quite an old question, I was just searching questions on chess and found this one unanswered. Well it may not be of any help to you now, but may prove helpful to other users.
I didn't see null move pruning, transposition tables.. are you using them? They would give you a big boost...
One thing that gave me a big boost was minimizing conditional branching... Alot of things can be precomputed. Search for such opportunities.
Most modern PCs have multiple cores so it would be a good idea making it multithreading. You don't necessarily need to go MDF(f) for that.
I wont suggest moving your code to bitboard. Its simply too much work. Even though bitboards could give a boost on 64 bit machines.
Finally and most importantly chess literature dominates any optimizations we may use. optimization is too much work. Look at open source chess engines, particularly crafty and fruit/toga. Fruit used to be open source initially.
Late answer, but this may help someone:
Given all the optimizations you mentioned, 1450 ELO is very low. My guess is that something is very wrong with your code. Did you:
Wrote a perft routine and ran it through a set of positions? All tests should pass, so you know your move generator is free of bugs. If you don't have this, there's no point in talking about ELO.
Wrote a mirrorBoard routine and ran the evaluation code through a set of positions? The result should be the same for the normal and mirrored positions, otherwise you have a bug in your eval.
Do you have a hashtable (aka transposition table)? If not, this is a must. It will help while searching and ordering moves, giving a brutal difference in speed.
How do you implement move ordering? This links back to point 3.
Did you implement the UCI protocol? Is your move parsing function working properly? I had a bug like this in my engine:
/* Parses a uci move string and return a Board object */
Board parseUCIMoves(String moves)// e2e4 c7c5 g1f3 ...{
//...
if (someMove.equals("e1g1") || someMove.equals("e1c1"))
//apply proper castle
//...
}
Sometimes the engine crashed while playing a match, and I thought it was the GUI fault, since all perft tests were ok. It took me one week to find the bug by luck. So, test everything.
For (1) you can search every position to depth 6. I use a file with ~1000 positions. See here https://chessprogramming.wikispaces.com/Perft
For (2) you just need a file with millions of positions (just the FEN string).
Given all the above and a very basic evaluation function (material, piece square tables, passed pawns, king safety) it should play at +-2000 ELO.
As far as tips, I know large gains can be found in optimizing your move generation routines before any eval functions. Making that function as tight as possible can give you 10% or more in nodes/sec improvement.
If you're moving to bitboards, do some digging on rec.games.chess.computer archives for some of Dr. Robert Hyatts old posts about Crafty (pretty sure he doesn't post anymore). Or grab the latest copy from his FTP and start digging. I'm pretty sure it would be a significant shift for you though.
Transposition Table
Opening Book
End Game Table Bases
Improved Static Board Evaluation for Leaf Nodes
Bitboards for Raw Speed
Profile and benchmark. Theoretical optimizations are great, but unless you are measuring the performance impact of every change you make, you won't know whether your work is improving or worsening the speed of the final code.
Try to limit the penalty to yourself for trying different algorithms. Make it easy to test various implementations of algorithms against one another. i.e. Make it easy to build a PVS version of your code as well as a NegaScout version.
Find the hot spots. Refactor. Rewrite in assembly if necessary. Repeat.
Assuming "history heuristic" involves some sort of database of past moves, a learning algorithm isn't going to give you much more unless it plays a lot of games against the same player. You can probably achieve more by classifying a player and tweaking the selection of moves from your historic database.
It's been a long time since I've done any programming on any chess program, but at the time, bit boards did give a real improvement. Other than that I can't give you much advise. Do you only evaluate the position of pawns? Some (slight) bonuses for position or mobility of some key pieces may be in order.
I'm not certain what type of thing you would like it to learn however...

critical path analysis

I'm trying to write a VB6 program (for a laugh) that will compute event times + the critical path JUST BASED ON A PRECEDENCE TABLE. I want my students to use it as a checking mechanism ie. to do everything without drawing the activity network. I'm happy that I can do all this once I've got start and finish events for each activity. How do I allocate events without drawing the network. Everything I come up with works for a specific example and then doesn't work for another one. I need a more general algorithm and it's driving me mental. Help!
I am not a professional programmer - I do this in my spare time to create teaching resources - simple English would really be appreciated.
Okay, so you have a precedence table, which I take to be a table of pairs like
A→B
B→C
and so forth, for activities {A,B,C}. Each of the activities also has a duration and (maybe) a distribution on the duration, so you know A takes 3 days, B takes 2, and so on. This would be interpreted as "A must be finished before B which must be finished before C".
Right?
Now, the obvious thing to do is construct the graph of activities and arrows -- in fact, you basically have the graph there in incidence-list form. The critical part is the greatest-weight (biggest sum of times) path. This is a longest-path problem, and assuming your chart isn't cyclic (which would be bad anyway) it can be solved with topological sort or transitive closure.

How to detect anomalous resource consumption reliably?

This question is about a whole class of similar problems, but I'll ask it as a concrete example.
I have a server with a file system whose contents fluctuate. I need to monitor the available space on this file system to ensure that it doesn't fill up. For the sake of argument, let's suppose that if it fills up, the server goes down.
It doesn't really matter what it is -- it might, for example, be a queue of "work".
During "normal" operation, the available space varies within "normal" limits, but there may be pathologies:
Some other (possibly external)
component that adds work may run out
of control
Some component that removes work seizes up, but remains undetected
The statistical characteristics of the process are basically unknown.
What I'm looking for is an algorithm that takes, as input, timed periodic measurements of the available space (alternative suggestions for input are welcome), and produces as output, an alarm when things are "abnormal" and the file system is "likely to fill up". It is obviously important to avoid false negatives, but almost as important to avoid false positives, to avoid numbing the brain of the sysadmin who gets the alarm.
I appreciate that there are alternative solutions like throwing more storage space at the underlying problem, but I have actually experienced instances where 1000 times wasn't enough.
Algorithms which consider stored historical measurements are fine, although on-the-fly algorithms which minimise the amount of historic data are preferred.
I have accepted Frank's answer, and am now going back to the drawing-board to study his references in depth.
There are three cases, I think, of interest, not in order:
The "Harrods' Sale has just started" scenario: a peak of activity that at one-second resolution is "off the dial", but doesn't represent a real danger of resource depletion;
The "Global Warming" scenario: needing to plan for (relatively) stable growth; and
The "Google is sending me an unsolicited copy of The Index" scenario: this will deplete all my resources in relatively short order unless I do something to stop it.
It's the last one that's (I think) most interesting, and challenging, from a sysadmin's point of view..
If it is actually related to a queue of work, then queueing theory may be the best route to an answer.
For the general case you could perhaps attempt a (multiple?) linear regression on the historical data, to detect if there is a statistically significant rising trend in the resource usage that is likely to lead to problems if it continues (you may also be able to predict how long it must continue to lead to problems with this technique - just set a threshold for 'problem' and use the slope of the trend to determine how long it will take). You would have to play around with this and with the variables you collect though, to see if there is any statistically significant relationship that you can discover in the first place.
Although it covers a completely different topic (global warming), I've found tamino's blog (tamino.wordpress.com) to be a very good resource on statistical analysis of data that is full of knowns and unknowns. For example, see this post.
edit: as per my comment I think the problem is somewhat analogous to the GW problem. You have short term bursts of activity which average out to zero, and long term trends superimposed that you are interested in. Also there is probably more than one long term trend, and it changes from time to time. Tamino describes a technique which may be suitable for this, but unfortunately I cannot find the post I'm thinking of. It involves sliding regressions along the data (imagine multiple lines fitted to noisy data), and letting the data pick the inflection points. If you could do this then you could perhaps identify a significant change in the trend. Unfortunately it may only be identifiable after the fact, as you may need to accumulate a lot of data to get significance. But it might still be in time to head off resource depletion. At least it may give you a robust way to determine what kind of safety margin and resources in reserve you need in future.

Resources