Weather prediction algorithm variety - algorithm

Currently there's a big 'storm' over the predictions by the MetOffice in the UK. They predicted a mild, wet winter, while we have the coldest temperature on record in Northern Ireland and solid snow on the ground, normally rare in December.
It's something I'd love to have a play with, not that I'm claiming I can beat them, but was wondering what algorithms are out there currently that people are working with? What datasets do they base it on?
Possibilities presumably include neural networks modelling input with fitness being the accuracy of the prediction, complex mathematical models, or even the 'same as yesterday' prediction which I've heard claim (although not seen evidence) that it's more reliable for single-day prediction (although obviously drops off after that).
Ideally like to hear from some developers in weather centres or who get access to the supercomputers, it'd be interesting to hear approaches...

In short, if you intend to build and run your own forecasting model, you will face three major problems:
Access to observations
Development of a mathematical model
Computational power to run your model
Access to observation
As far as I know, access to good meteorological observations costs a lot of money.
You need to have observations from all over the globe and model the state of oceans and atmosphere for the whole planet. Alternatively, you need to obtain so-called lateral boundary conditions from someone who calculates a global model.
Development of a mathematical model
I'm not and I've never been affiliated with Met Office, but I used to port and optimize a version of their Unified Model to a supercomputer at our center a couple of years ago. Here's how I remember the model.
Met Office has been developing their Unified Model for the last 20+ years, we're talking about millions of lines of code that contain state of the art ocean/atmospheric models and numerical algorithms. Check out this section of (outdated) User Guide for a glimpse of scientific methods used in their model. It's a fruit of, give or take, half a century of well-funded, extensive research by a large community of smart people. If there was a simple solution that would consistently give better results than the complex models, someone would've probably implemented it by now.
To conclude, I guess it's very hard to get even remotely satisfactory results in weather forecasting by building a model from scratch, unless you're a MSc/PhD in atmospheric physics and you've got a couple of years of free time on your hands.
Computational power to run your model
The first forecasting models were run in the middle of 20th century on machines that cannot match with today's cellphones, so, technically, you could calculate something on your PC. However, this type of job is often done on very, very powerful machines. In fact, 10 systems in the Top500 are dedicated solely to weather forecasting and climate research.
Interesting reads
http://en.wikipedia.org/wiki/Weather_forecasting#How_models_create_forecasts
http://en.wikipedia.org/wiki/Numerical_weather_prediction
http://research.metoffice.gov.uk/research/nwp/numerical/operational/index.html
http://ncas-cms.nerc.ac.uk/html_umdocs/UM55_User_Guide/
UPDATE It's possible to obtain the source code of the WRF model for free, together with some met data. Note that WRF, Unified Model, COAMPS, and many other models are written primarily in Fortran.

First off, you can import raw data from http://tgftp.nws.noaa.gov and other weather data. The best way for the computer to understand the data is putting it on a map. Each point on the map reacts with each other. Data at each point can represent Temp, Pressure, Wind and Direction, Cloud Coverage, Where sun is in the sky, Visibility, last 100hrs of precipitation. You could make predictions, then compare them later to the actual predictions as well as the Weather Service's predictions. Then update a climate model for that data point. That way, it could be a self learning neural network. As far as computation power is concerned, Get a Titan, Big Mac!

It seems to be possible to construct simple forecast model. My watch features a barometer and a thermometer (which is not usable at all, because the watch is warmed by the hand). Solely on those measurements, it has several times warned me of incoming rain, in spite of sunny forecasts from internet sites. (the cloud picture at upper left corner)
A quick search leads us to the Sager Algorithm, which uses only very simple input data. However, while the implementation claims to be open-source, I have failed to locate both the code and scientific papers on the algorithm.

Related

What Machine Learning algorithm would be appropriate?

I am working on a predictor for learning the most likely period for grape harvesting, depending on weather and on the characteristics of grape, namely sugar level, Ph, acidity. I've got two datasets and I am thinking of how to merge them together: one is the pre-harvest analysis data of some Italian vineyards in the 2003-2013 period, the other is the weather on that decade. What I want to do is learning from my samples when to harvest, given a range for the optimal sugar level, Ph and acidity, and given a weather forecast.
I thought that some Reinforcement Learning approach could work. Since the pre-harvest analysis are done about 5 times during the grape maturation period, I thought that those could be states I step in, while the weather conditions could be the "probabilities" of going from a state to another.
Yet I am not sure of what algorithm would be the best as every state and every "probability" depends on several variables. I was told that Hidden Markov Model would work, but it seems to me that my problem doesn't fit the model perfectly.
Do you have any suggestion? Thx in advance
This has nothing to do with the actual algorithm, but the problem you are going to run into here is that weather is extremely local. One vineyard can have completely different weather than another only a mile away from it, believe or not. If you put rain gauges at each vineyard, you will find this out. To get really good results you need to have a mini weather station at each vineyard. Absent this, your best option is to use only vineyards in the immediate vicinity of the weather measurements. For example, if your data is from an airport, only use vineyards right next to the airport.
Reinforcement learning is appropriate when you can control the action. It is like a monkey pushing buttons. You push a button and get shocked, so you don't push that button again. Here you have a passive data set and cannot conduct experimental actions, so reinforcement learning does not apply.
Here you have a complex set of uncontrolled inputs, the weather data, a controlled input (harvest time), and several output parameters, sugar etc. Given that data, you want to predict what harvest time to use for some future, unknown weather pattern.
In general, what you are doing is sensitivity analysis: trying to figure out how your factors affected the outcome that occurred. The tricky part is that the outcomes may be driven by some non-obvious pattern. For example, maybe 3 weeks of drought, followed by 2 weeks of heavy rain implies the best harvest will be 65 days hence, or something like that.
So, what you have to do is featurize the data to characterize it in possible likely ways, then do a sensitivity analysis. If the analysis has a strong correlation, then you have found a solution. If it does not, then you have to find a different way to featurize the data. For example, your featurization might be number of days with rain over 2 inches, or it might be most number of days without rain, or it might be total number of days with bright sunshine. Possibly multiple features might combine to make a solution. The options are limited only by your imagination.
Of course, as I was saying above, the fly in the ointment is that your weather data will only roughly approximate the real and actual weather at the particular vineyard, so there will be noise in the data, possibly so much noise as to make getting a good result impossible.
Why you actually don't care too much about the weather
Getting back to the data, having unreliable weather information is actually not a problem, because you actually don't care too much about the weather. The reason is two-fold. First of all, the question you are trying to answer is not when to harvest the grapes, it is whether to wait to harvest or not. The vintner can always measure the current sugar of the grapes. So, he just has to decide, "Should I harvest the grapes now with sugar X%, or should I wait and possibly get a better sugar Z% later? To answer this question the real data you need is not the weather, it is a series of sugar/acidity readings taken over time. What you want to predict is whether, given a situation, the grapes will get better or whether they will get worse.
Secondly, grapevines have an optimal amount of moisture they like. If the vine gets too dry, that is bad, if it gets too wet that is bad. You cannot predict how moist a vine is from the weather. Some soils hold moisture well, others are sandy. A sandy vineyard will require more rain than a clay vineyard to have the same moisture levels. Also, the vintner can water his vineyards, completely invalidating the rainfall pattern. Therefore, weather is pretty much a non-factor.
I agree with Tyler that from a feasible standpoint weather might harm your analysis. However, I think this is for you to test and find out!- there could be some interesting data that comes out of it.
I'm not sure exactly what your test is, but a simple way to start perhaps is to make this into a classification problem using svm (or even logistic regression since you want probabilities) and use all the data as the input for the algorithm- assuming you know which years were good harvest years or not. You could even test each variable individually and see how it effects your performance. I suggest you go this way if you can just because there's massive amounts of sources on the net and people here on SO that can help you tune your algo.
When you have a handle on this, I would, as you seem to have been suggested before, try the HMM- as it will tell you which day was probably the best for the harvest. This is where the weather might hurt, but you'll come to understand more about your data from the simpler experiments.
The think I've learned about machine learning is that while there are guidelines for when to choose which algorithm its not always set in stone and you can change your question slightly and try a new approach to the problem, depending how much freedom you have to play with the data. Good luck and have fun!

Algorithms for realtime strategy wargame AI

I'm designing a realtime strategy wargame where the AI will be responsible for controlling a large number of units (possibly 1000+) on a large hexagonal map.
A unit has a number of action points which can be expended on movement, attacking enemy units or various special actions (e.g. building new units). For example, a tank with 5 action points could spend 3 on movement then 2 in firing on an enemy within range. Different units have different costs for different actions etc.
Some additional notes:
The output of the AI is a "command" to any given unit
Action points are allocated at the beginning of a time period, but may be spent at any point within the time period (this is to allow for realtime multiplayer games). Hence "do nothing and save action points for later" is a potentially valid tactic (e.g. a gun turret that cannot move waiting for an enemy to come within firing range)
The game is updating in realtime, but the AI can get a consistent snapshot of the game state at any time (thanks to the game state being one of Clojure's persistent data structures)
I'm not expecting "optimal" behaviour, just something that is not obviously stupid and provides reasonable fun/challenge to play against
What can you recommend in terms of specific algorithms/approaches that would allow for the right balance between efficiency and reasonably intelligent behaviour?
If you read Russell and Norvig, you'll find a wealth of algorithms for every purpose, updated to pretty much today's state of the art. That said, I was amazed at how many different problem classes can be successfully approached with Bayesian algorithms.
However, in your case I think it would be a bad idea for each unit to have its own Petri net or inference engine... there's only so much CPU and memory and time available. Hence, a different approach:
While in some ways perhaps a crackpot, Stephen Wolfram has shown that it's possible to program remarkably complex behavior on a basis of very simple rules. He bravely extrapolates from the Game of Life to quantum physics and the entire universe.
Similarly, a lot of research on small robots is focusing on emergent behavior or swarm intelligence. While classic military strategy and practice are strongly based on hierarchies, I think that an army of completely selfless, fearless fighters (as can be found marching in your computer) could be remarkably effective if operating as self-organizing clusters.
This approach would probably fit a little better with Erlang's or Scala's actor-based concurrency model than with Clojure's STM: I think self-organization and actors would go together extremely well. Still, I could envision running through a list of units at each turn, and having each unit evaluating just a small handful of very simple rules to determine its next action. I'd be very interested to hear if you've tried this approach, and how it went!
EDIT
Something else that was on the back of my mind but that slipped out again while I was writing: I think you can get remarkable results from this approach if you combine it with genetic or evolutionary programming; i.e. let your virtual toy soldiers wage war on each other as you sleep, let them encode their strategies and mix, match and mutate their code for those strategies; and let a refereeing program select the more successful warriors.
I've read about some startling successes achieved with these techniques, with units operating in ways we'd never think of. I have heard of AIs working on these principles having had to be intentionally dumbed down in order not to frustrate human opponents.
First you should aim to make your game turn based at some level for the AI (i.e. you can somehow model it turn based even if it may not be entirely turn based, in RTS you may be able to break discrete intervals of time into turns.) Second, you should determine how much information the AI should work with. That is, if the AI is allowed to cheat and know every move of its opponent (thereby making it stronger) or if it should know less or more. Third, you should define a cost function of a state. The idea being that a higher cost means a worse state for the computer to be in. Fourth you need a move generator, generating all valid states the AI can transition to from a given state (this may be homogeneous [state-independent] or heterogeneous [state-dependent].)
The thing is, the cost function will be greatly influenced by what exactly you define the state to be. The more information you encode in the state the better balanced your AI will be but the more difficult it will be for it to perform, as it will have to search exponentially more for every additional state variable you include (in an exhaustive search.)
If you provide a definition of a state and a cost function your problem transforms to a general problem in AI that can be tackled with any algorithm of your choice.
Here is a summary of what I think would work well:
Evolutionary algorithms may work well if you put enough effort into them, but they will add a layer of complexity that will create room for bugs amongst other things that can go wrong. They will also require extreme amounts of tweaking of the fitness function etc. I don't have much experience working with these but if they are anything like neural networks (which I believe they are since both are heuristics inspired by biological models) you will quickly find they are fickle and far from consistent. Most importantly, I doubt they add any benefits over the option I describe in 3.
With the cost function and state defined it would technically be possible for you to apply gradient decent (with the assumption that the state function is differentiable and the domain of the state variables are continuous) however this would probably yield inferior results, since the biggest weakness of gradient descent is getting stuck in local minima. To give an example, this method would be prone to something like attacking the enemy always as soon as possible because there is a non-zero chance of annihilating them. Clearly, this may not be desirable behaviour for a game, however, gradient decent is a greedy method and doesn't know better.
This option would be my most highest recommended one: simulated annealing. Simulated annealing would (IMHO) have all the benefits of 1. without the added complexity while being much more robust than 2. In essence SA is just a random walk amongst the states. So in addition to the cost and states you will have to define a way to randomly transition between states. SA is also not prone to be stuck in local minima, while producing very good results quite consistently. The only tweaking required with SA would be the cooling schedule--which decides how fast SA will converge. The greatest advantage of SA I find is that it is conceptually simple and produces superior results empirically to most other methods I have tried. Information on SA can be found here with a long list of generic implementations at the bottom.
3b. (Edit Added much later) SA and the techniques I listed above are general AI techniques and not really specialized to AI for games. In general, the more specialized the algorithm the more chance it has at performing better. See No Free Lunch Theorem 2. Another extension of 3 is something called parallel tempering which dramatically improves the performance of SA by helping it avoid local optima. Some of the original papers on parallel tempering are quite dated 3, but others have been updated4.
Regardless of what method you choose in the end, its going to be very important to break your problem down into states and a cost function as I said earlier. As a rule of thumb I would start with 20-50 state variables as your state search space is exponential in the number of these variables.
This question is huge in scope. You are basically asking how to write a strategy game.
There are tons of books and online articles for this stuff. I strongly recommend the Game Programming Wisdom series and AI Game Programming Wisdom series. In particular, Section 6 of the first volume of AI Game Programming Wisdom covers general architecture, Section 7 covers decision-making architectures, and Section 8 covers architectures for specific genres (8.2 does the RTS genre).
It's a huge question, and the other answers have pointed out amazing resources to look into.
I've dealt with this problem in the past and found the simple-behavior-manifests-complexly/emergent behavior approach a bit too unwieldy for human design unless approached genetically/evolutionarily.
I ended up instead using abstracted layers of AI, similar to a way armies work in real life. Units would be grouped with nearby units of the same time into squads, which are grouped with nearby squads to create a mini battalion of sorts. More layers could be use here (group battalions in a region, etc.), but ultimately at the top there is the high-level strategic AI.
Each layer can only issue commands to the layers directly below it. The layer below it will then attempt to execute the command with the resources at hand (ie, the layers below that layer).
An example of a command issued to a single unit is "Go here" and "shoot at this target". Higher level commands issued to higher levels would be "secure this location", which that level would process and issue the appropriate commands to the lower levels.
The highest level master AI is responsible for very board strategic decisions, such as "we need more ____ units", or "we should aim to move towards this location".
The army analogy works here; commanders and lieutenants and chain of command.

Predictive "blood glucose" algorithm?

I'm writing an app that lets a diabetic user enter his/her "blood glucose" readings, and then charts them on a graph over time from left to right. Since the blood readings will be done only several times a day, an algorithm would be handy to:
a) fill in the gaps on the graph between readings (curves would be more realistic than jerky lines) and allow a more accurate "blood glucose level" daily average
b) roughly predict what will happen in the future (if the user eats nothing that will affect his blood levels)
I suck at calculus. I'm hoping someone here knows a library for this stuff? I'm hoping someone knows of an algorithm that has been tailored for this specific problem already (e.g.: where someone has compared it to real data from diabetics)
Disclaimer: I am very aware that any such algorithm would vary wildly depending on the user. I'm just looking to improve on straight angular lines. Regardless of the diabetic, there is a limit to the rate that blood sugars can rise and fall.
I'm using Javascript, but as it's just math, I could port it from C, Java or whatever.
Blood sugar behavior is very complicated. It is affected by
Current blood sugar (complicated by the possible presence of ketones if the patient is hyperglycemic)
recent food out to several hours depending on the type and how much
recent fast acting insulin (with variety and patient dependent reaction profiles between 45 minutes and two hours long. Oh, and delivery mechanism)
long-acting insulin out past 12 hours (again patient and variety dependent)
activity levels
stress levels
illness
basal insulin rate if the patient wears a pump
ad nauseum
Very hard problem. Any heuristic---any heuristic---you chose would be highly misleading. So short answer:
Don't do it.
This comes, in part, from having compared a diabetic's 24-hour continous glucose log with the ~10 finger pricks taken during the same time. I.e. my suggestion is data driven.
Edit: Evidently I didn't make myself clear.
You can't even get close.
Nothing you can do with finger prick data can be remotely reliable.
Connecting the dots with any lines (even straight segments) is just plain wrong. It doesn't reflect reality. Not even a little bit.
I'm an experimental particle physicist. Complicated data sets are what I do. There is a diabetic in my life (did you guess?). This matters to me.
But I've seen the high frequency data logs, side-by-side with a log of the days finger-pricks, exercise, food, and insulin.
If you could get every-fifteen-minutes data, I'd say go ahead and use a spline. It won't be dangerously misleading. But, if you have 6-10 measurements across the day, you know nothing.
Good news: continuous monitoring is coming down in price. It's out of the lab and available with some pumps even now.
For those who aren't familiar with this: compliant diabetic patients do (results of extremely unscientific polling) 4-6+ glucose tests a day as a matter of course, and several additional ones in the 1-2 hours following any unexpected excursion (they get physical symptoms that allow them to detect severe excursions).
This serves to give the patient a rough idea of how they are doing at controlling their glucose levels, but they also go to a lab to get a Hemoglobin A1C drawn every quarter (or so). The A1C result is dependent mostly on their average blood glucose.
I've talked to people who clocked in a 80-110 (quite favorable numbers) four times a day for months, and got back an A1C suggesting an average above 150 (not desirable at all). Presumable they were going high in the night. And I've heard similar stories from people who we probably going low---very low---in their sleep.
The lesson is:
Finger prick readings have their place, but don't try to extrapolate them to times not well sampled.
If you want to do just a straight fit of the data to make things easier to view then something like what Charlie Martin recommended would likely work well. However, as noted by dmckee this data really wouldn't mean anything.
What you are trying to do is actually more in line with pharmacokenetics which is an entire scientific study in and of itself. In this case I'm not even sure it would entirely apply except in the case of Type I Diabetes as most of what I know about pharamcokenetics only applies drug studies, but if something is being produced by the body then you are likely looking at entirely different types of analysis. If you are interested in the subject then there are quite a few book previews on Google Books if you do a search for "pharmacokienetics" but due to the nature of the subject they are very math heavy and assume that you have an understanding of chemistry and biology as well.
okay, you're going to be looking for some fitted curve. The thing with that is that for n points there are fit polynomials up to order... n-1 I think. It's been a while. Yep. by golly, I'm right. The common thing when you have lots of points and don't wants a complicated function (which you don't) is to use a least-squares approximation.
probably the best thing is to look for a canned routine you can use; these exist in most stats packages. Give us a little more detail on the environment you want and we might be able to point you more closely to something suitable.
This is most likely not going to work but Artificial Neural Networks may, and i repeat may be able to get something out of a good data set. By good, i mean like weeks or months of continuous recording, and even then i wouldn't trust the data set unless i had very good reason to. I also don't think you'll get predictive data out of it, but it may depend on how you implement it. Overall if you were to do this it would seem to be more of a hobby thing to see if it even even come close, like "oh neat i got a neural network to within X amount of accuracy". Again, i must stress, don't use this in any sort of production situations or anywhere where it could possibly hurt or kill someone!

What are good examples of genetic algorithms/genetic programming solutions? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Genetic algorithms (GA) and genetic programming (GP) are interesting areas of research.
I'd like to know about specific problems you have solved using GA/GP and what libraries/frameworks you used if you didn't roll your own.
Questions:
What problems have you used GA/GP to solve?
What libraries/frameworks did you use?
I'm looking for first-hand experiences, so please do not answer unless you have that.
Not homework.
My first job as a professional programmer (1995) was writing a genetic-algorithm based automated trading system for S&P500 futures. The application was written in Visual Basic 3 [!] and I have no idea how I did anything back then, since VB3 didn't even have classes.
The application started with a population of randomly-generated fixed-length strings (the "gene" part), each of which corresponded to a specific shape in the minute-by-minute price data of the S&P500 futures, as well as a specific order (buy or sell) and stop-loss and stop-profit amounts. Each string (or "gene") had its profit performance evaluated by a run through 3 years of historical data; whenever the specified "shape" matched the historical data, I assumed the corresponding buy or sell order and evaluated the trade's result. I added the caveat that each gene started with a fixed amount of money and could thus potentially go broke and be removed from the gene pool entirely.
After each evaluation of a population, the survivors were cross-bred randomly (by just mixing bits from two parents), with the likelihood of a gene being selected as a parent being proportional to the profit it produced. I also added the possibility of point mutations to spice things up a bit. After a few hundred generations of this, I ended up with a population of genes that could turn $5000 into an average of about $10000 with no chance of death/brokeness (on the historical data, of course).
Unfortunately, I never got the chance to use this system live, since my boss lost close to $100,000 in less than 3 months trading the traditional way, and he lost his willingness to continue with the project. In retrospect, I think the system would have made huge profits - not because I was necessarily doing anything right, but because the population of genes that I produced happened to be biased towards buy orders (as opposed to sell orders) by about a 5:1 ratio. And as we know with our 20/20 hindsight, the market went up a bit after 1995.
I made a little critters that lived in this little world. They had a neural network brain which received some inputs from the world and the output was a vector for movement among other actions. Their brains were the "genes".
The program started with a random population of critters with random brains. The inputs and output neurons were static but what was in between was not.
The environment contained food and dangers. Food increased energy and when you have enough energy, you can mate. The dangers would reduce energy and if energy was 0, they died.
Eventually the creatures evolved to move around the world and find food and avoid the dangers.
I then decided to do a little experiment. I gave the creature brains an output neuron called "mouth" and an input neuron called "ear". Started over and was surprised to find that they evolved to maximize the space and each respective creature would stay in its respective part (food was placed randomly). They learned to cooperate with each other and not get in each others way. There were always the exceptions.
Then i tried something interesting. I dead creatures would become food. Try to guess what happened! Two types of creatures evolved, ones that attacked like in swarms, and ones that were high avoidance.
So what is the lesson here? Communication means cooperation. As soon as you introduce an element where hurting another means you gain something, then cooperation is destroyed.
I wonder how this reflects on the system of free markets and capitalism. I mean, if businesses can hurt their competition and get away with it, then its clear they will do everything in their power to hurt the competition.
Edit:
I wrote it in C++ using no frameworks. Wrote my own neural net and GA code. Eric, thank you for saying it is plausible. People usually don't believe in the powers of GA (although the limitations are obvious) until they played with it. GA is simple but not simplistic.
For the doubters, neural nets have been proven to be able to simulate any function if they have more than one layer. GA is a pretty simple way to navigate a solution space finding local and potentially global minimum. Combine GA with neural nets and you have a pretty good way to find functions that find approximate solutions for generic problems. Because we are using neural nets, then we are optimizing the function for some inputs, not some inputs to a function as others are using GA
Here is the demo code for the survival example: http://www.mempko.com/darcs/neural/demos/eaters/
Build instructions:
Install darcs, libboost, liballegro, gcc, cmake, make
darcs clone --lazy http://www.mempko.com/darcs/neural/
cd neural
cmake .
make
cd demos/eaters
./eaters
In January 2004, I was contacted by Philips New Display Technologies who were creating the electronics for the first ever commercial e-ink, the Sony Librie, who had only been released in Japan, years before Amazon Kindle and the others hit the market in US an Europe.
The Philips engineers had a major problem. A few months before the product was supposed to hit the market, they were still getting ghosting on the screen when changing pages. The problem was the 200 drivers that were creating the electrostatic field. Each of these drivers had a certain voltage that had to be set right between zero and 1000 mV or something like this. But if you changed one of them, it would change everything.
So optimizing each driver's voltage individually was out of the question. The number of possible combination of values was in billions,and it took about 1 minute for a special camera to evaluate a single combination. The engineers had tried many standard optimization techniques, but nothing would come close.
The head engineer contacted me because I had previously released a Genetic Programming library to the open-source community. He asked if GP/GA's would help and if I could get involved. I did, and for about a month we worked together, me writing and tuning the GA library, on synthetic data, and him integrating it into their system. Then, one weekend they let it run live with the real thing.
The following Monday I got these glowing emails from him and their hardware designer, about how nobody could believe the amazing results the GA found. This was it. Later that year the product hit the market.
I didn't get paid one cent for it, but I got 'bragging' rights. They said from the beginning they were already over budget, so I knew what the deal was before I started working on it. And it's a great story for applications of GAs. :)
I used a GA to optimize seating assignments at my wedding reception. 80 guests over 10 tables. Evaluation function was based on keeping people with their dates, putting people with something in common together, and keeping people with extreme opposite views at separate tables.
I ran it several times. Each time, I got nine good tables, and one with all the odd balls. In the end, my wife did the seating assignments.
My traveling salesman optimizer used a novel mapping of chromosome to itinerary, which made it trivial to breed and mutate the chromosomes without any risk of generating invalid tours.
Update: Because a couple people have asked how ...
Start with an array of guests (or cities) in some arbitrary but consistent ordering, e.g., alphabetized. Call this the reference solution. Think of a guest's index as his/her seat number.
Instead of trying to encode this ordering directly in the chromosome, we encode instructions for transforming the reference solution into a new solution. Specifically, we treat the chromosomes as lists of indexes in the array to swap. To get decode a chromosome, we start with the reference solution and apply all the swaps indicated by the chromosome. Swapping two entries in the array always results in a valid solution: every guest (or city) still appears exactly once.
Thus chromosomes can be randomly generated, mutated, and crossed with others and will always produce a valid solution.
I used genetic algorithms (as well as some related techniques) to determine the best settings for a risk management system that tried to keep gold farmers from using stolen credit cards to pay for MMOs. The system would take in several thousand transactions with "known" values (fraud or not) and figure out what the best combination of settings was to properly identify the fraudulent transactions without having too many false positives.
We had data on several dozen (boolean) characteristics of a transaction, each of which was given a value and totalled up. If the total was higher than a threshold, the transaction was fraud. The GA would create a large number of random sets of values, evaluate them against a corpus of known data, select the ones that scored the best (on both fraud detection and limiting the number of false positives), then cross breed the best few from each generation to produce a new generation of candidates. After a certain number of generations the best scoring set of values was deemed the winner.
Creating the corpus of known data to test against was the Achilles' heel of the system. If you waited for chargebacks, you were several months behind when trying to respond to the fraudsters, so someone would have to manually review large numbers of transactions to build up that corpus of data without having to wait too long.
This ended up identifying the vast majority of the fraud that came in, but couldn't quite get it below 1% on the most fraud-prone items (given that 90% of incoming transactions could be fraud, that was doing pretty well).
I did all this using perl. One run of the software on a fairly old linux box would take 1-2 hours to run (20 minutes to load data over a WAN link, the rest of the time spent crunching). The size of any given generation was limited by available RAM. I'd run it over and over with slight changes to the parameters, looking for an especially good result set.
All in all it avoided some of the gaffes that came with manually trying to tweak the relative values of dozens of fraud indicators, and consistently came up with better solutions than I could create by hand. AFAIK, it's still in use (about 3 years after I wrote it).
Football Tipping. I built a GA system to predict the week to week outcome of games in the AFL (Aussie Rules Football).
A few years ago I got bored of the standard work football pool, everybody was just going online and taking the picks from some pundit in the press. So, I figured it couldn't be too hard to beat a bunch of broadcast journalism majors, right? My first thought was to take the results from Massey Ratings and then reveal at the end of the season my strategy after winning fame and glory. However, for reasons I've never discovered Massey does not track AFL. The cynic in me believes it is because the outcome of each AFL game has basically become random chance, but my complaints of recent rule changes belong in a different forum.
The system basically considered offensive strength, defensive strength, home field advantage, week to week improvement (or lack thereof) and velocity of changes to each of these. This created a set of polynomial equations for each team over the season. The winner and score for each match for a given date could be computed. The goal was to find the set of coefficients that most closely matched the outcome of all past games and use that set to predict the upcoming weeks game.
In practice, the system would find solutions that accurately predicted over 90% of past game outcomes. It would then successfully pick about 60-80% of games for the upcoming week (that is the week not in the training set).
The result: just above middle of the pack. No major cash prize nor a system that I could use to beat Vegas. It was fun though.
I built everything from scratch, no framework used.
As well as some of the common problems, like the Travelling Salesman and a variation on Roger Alsing's Mona Lisa program, I've also written an evolutionary Sudoku solver (which required a bit more original thought on my part, rather than just re-implementing somebody else's idea). There are more reliable algorithms for solving Sudokus but the evolutionary approach works fairly well.
In the last few days I've been playing around with an evolutionary program to find "cold decks" for poker after seeing this article on Reddit. It's not quite satisfactory at the moment but I think I can improve it.
I have my own framework that I use for evolutionary algorithms.
I developed a home brew GA for a 3D laser surface profile system my company developed for the freight industry back in 1992.
The system relied upon 3 dimensional triangulation and used a custom laser line scanner, a 512x512 camera (with custom capture hw). The distance between the camera and laser was never going to be precise and the focal point of the cameras were not to be found in the 256,256 position that you expected it to be!
It was a nightmare to try and work out the calibration parameters using standard geometry and simulated annealing style equation solving.
The Genetic algorithm was whipped up in an evening and I created a calibration cube to test it on. I knew the cube dimensions to high accuracy and thus the idea was that my GA could evolve a set of custom triangulation parameters for each scanning unit that would overcome production variations.
The trick worked a treat. I was flabbergasted to say the least! Within around 10 generations my 'virtual' cube (generated from the raw scan and recreated from the calibration parameters) actually looked like a cube! After around 50 generations I had the calibration I needed.
Its often difficult to get an exact color combination when you are planning to paint your house. Often, you have some color in mind, but it is not one of the colors, the vendor shows you.
Yesterday, my Prof. who is a GA researcher mentioned about a true story in Germany (sorry, I have no further references, yes, I can find it out if any one requests to). This guy (let's call him the color guy) used to go from door-door to help people to find the exact color code (in RGB) that would be the closet to what the customer had in mind. Here is how he would do it:
The color guy used to carry with him a software program which used GA. He used to start with 4 different colors- each coded as a coded Chromosome (whose decoded value would be a RGB value). The consumer picks 1 of the 4 colors (Which is the closest to which he/she has in mind). The program would then assign the maximum fitness to that individual and move onto the next generation using mutation/crossover. The above steps would be repeated till the consumer had found the exact color and then color guy used to tell him the RGB combination!
By assigning maximum fitness to the color closes to what the consumer have in mind, the color guy's program is increasing the chances to converge to the color, the consumer has in mind exactly. I found it pretty fun!
Now that I have got a -1, if you are planning for more -1's, pls. elucidate the reason for doing so!
A couple of weeks ago, I suggested a solution on SO using genetic algorithms to solve a problem of graph layout. It is an example of a constrained optimization problem.
Also in the area of machine learning, I implemented a GA-based classification rules framework in c/c++ from scratch.
I've also used GA in a sample project for training artificial neural networks (ANN) as opposed to using the famous backpropagation algorithm.
In addition, and as part of my graduate research, I've used GA in training Hidden Markov Models as an additional approach to the EM-based Baum-Welch algorithm (in c/c++ again).
As part of my undergraduate CompSci degree, we were assigned the problem of finding optimal jvm flags for the Jikes research virtual machine. This was evaluated using the Dicappo benchmark suite which returns a time to the console. I wrote a distributed gentic alogirthm that switched these flags to improve the runtime of the benchmark suite, although it took days to run to compensate for hardware jitter affecting the results. The only problem was I didn't properly learn about the compiler theory (which was the intent of the assignment).
I could have seeded the initial population with the exisiting default flags, but what was interesting was that the algorithm found a very similar configuration to the O3 optimisation level (but was actually faster in many tests).
Edit: Also I wrote my own genetic algorithm framework in Python for the assignment, and just used the popen commands to run the various benchmarks, although if it wasn't an assessed assignment I would have looked at pyEvolve.
First off, "Genetic Programming" by Jonathan Koza (on amazon) is pretty much THE book on genetic and evolutionary algorithm/programming techniques, with many examples. I highly suggest checking it out.
As for my own use of a genetic algorithm, I used a (home grown) genetic algorithm to evolve a swarm algorithm for an object collection/destruction scenario (practical purpose could have been clearing a minefield). Here is a link to the paper. The most interesting part of what I did was the multi-staged fitness function, which was a necessity since the simple fitness functions did not provide enough information for the genetic algorithm to sufficiently differentiate between members of the population.
I am part of a team investigating the use of Evolutionary Computation (EC) to automatically fix bugs in existing programs. We have successfully repaired a number of real bugs in real world software projects (see this project's homepage).
We have two applications of this EC repair technique.
The first (code and reproduction information available through the project page) evolves the abstract syntax trees parsed from existing C programs and is implemented in Ocaml using our own custom EC engine.
The second (code and reproduction information available through the project page), my personal contribution to the project, evolves the x86 assembly or Java byte code compiled from programs written in a number of programming languages. This application is implemented in Clojure and also uses its own custom built EC engine.
One nice aspect of Evolutionary Computation is the simplicity of the technique makes it possible to write your own custom implementations without too much difficulty. For a good freely available introductory text on Genetic Programming see the Field Guide to Genetic Programming.
A coworker and I are working on a solution for loading freight onto trucks using the various criteria our company requires. I've been working on a Genetic Algorithm solution while he is using a Branch And Bound with aggressive pruning. We are still in the process of implementing this solution but so far, we have been getting good results.
Several years ago I used ga's to optimize asr (automatic speech recognition) grammars for better recognition rates. I started with fairly simple lists of choices (where the ga was testing combinations of possible terms for each slot) and worked my way up to more open and complex grammars. Fitness was determined by measuring separation between terms/sequences under a kind of phonetic distance function. I also experimented with making weakly equivalent variations on a grammar to find one that compiled to a more compact representation (in the end I went with a direct algorithm, and it drastically increased the size of the "language" that we could use in applications).
More recently I have used them as a default hypothesis against which to test the quality of solutions generated from various algorithms. This has largely involved categorization and different kinds of fitting problems (i.e. create a "rule" that explains a set of choices made by reviewers over a dataset(s)).
I made a complete GA framework named "GALAB", to solve many problems:
locating GSM ANTs (BTS) to decrease overlap & blank locations.
Resource constraint project scheduling.
Evolutionary picture creation. (Evopic)
Travelling salesman problem.
N-Queen & N-Color problems.
Knight's tour & Knapsack problems.
Magic square & Sudoku puzzles.
string compression, based on Superstring problem.
2D Packaging problem.
Tiny artificial life APP.
Rubik puzzle.
I once used a GA to optimize a hash function for memory addresses. The addresses were 4K or 8K page sizes, so they showed some predictability in the bit pattern of the address (least significant bits all zero; middle bits incrementing regularly, etc.) The original hash function was "chunky" - it tended to cluster hits on every third hash bucket. The improved algorithm had a nearly perfect distribution.
I built a simple GA for extracting useful patterns out of the frequency spectrum of music as it was being played. The output was used to drive graphical effects in a winamp plugin.
Input: a few FFT frames (imagine a 2D array of floats)
Output: single float value (weighted sum of inputs), thresholded to 0.0 or 1.0
Genes: input weights
Fitness function: combination of duty cycle, pulse width and BPM within sensible range.
I had a few GAs tuned to different parts of the spectrum as well as different BPM limits, so they didn't tend to converge towards the same pattern. The outputs from the top 4 from each population were sent to the rendering engine.
An interesting side effect was that the average fitness across the population was a good indicator for changes in the music, although it generally took 4-5 seconds to figure it out.
I don't know if homework counts...
During my studies we rolled our own program to solve the Traveling Salesman problem.
The idea was to make a comparison on several criteria (difficulty to map the problem, performance, etc) and we also used other techniques such as Simulated annealing.
It worked pretty well, but it took us a while to understand how to do the 'reproduction' phase correctly: modeling the problem at hand into something suitable for Genetic programming really struck me as the hardest part...
It was an interesting course since we also dabbled with neural networks and the like.
I'd like to know if anyone used this kind of programming in 'production' code.
I used a simple genetic algorithm to optimize the signal to noise ratio of a wave that was represented as a binary string. By flipping the the bits certain ways over several million generations I was able to produce a transform that resulted in a higher signal to noise ratio of that wave. The algorithm could have also been "Simulated Annealing" but was not used in this case. At their core, genetic algorithms are simple, and this was about as simple of a use case that I have seen, so I didn't use a framework for generation creation and selection - only a random seed and the Signal-to-Noise Ratio function at hand.
As part of my thesis I wrote a generic java framework for the multi-objective optimisation algorithm mPOEMS (Multiobjective prototype optimization with evolved improvement steps), which is a GA using evolutionary concepts. It is generic in a way that all problem-independent parts have been separated from the problem-dependent parts, and an interface is povided to use the framework with only adding the problem-dependent parts. Thus one who wants to use the algorithm does not have to begin from zero, and it facilitates work a lot.
You can find the code here.
The solutions which you can find with this algorithm have been compared in a scientific work with state-of-the-art algorithms SPEA-2 and NSGA, and it has been proven that
the algorithm performes comparable or even better, depending on the metrics you take to measure the performance, and especially depending on the optimization-problem you are looking on.
You can find it here.
Also as part of my thesis and proof of work I applied this framework to the project selection problem found in portfolio management. It is about selecting the projects which add the most value to the company, support most the strategy of the company or support any other arbitrary goal. E.g. selection of a certain number of projects from a specific category, or maximization of project synergies, ...
My thesis which applies this framework to the project selection problem:
http://www.ub.tuwien.ac.at/dipl/2008/AC05038968.pdf
After that I worked in a portfolio management department in one of the fortune 500, where they used a commercial software which also applied a GA to the project selection problem / portfolio optimization.
Further resources:
The documentation of the framework:
http://thomaskremmel.com/mpoems/mpoems_in_java_documentation.pdf
mPOEMS presentation paper:
http://portal.acm.org/citation.cfm?id=1792634.1792653
Actually with a bit of enthusiasm everybody could easily adapt the code of the generic framework to an arbitrary multi-objective optimisation problem.
At work I had the following problem: given M tasks and N DSPs, what was the best way to assign tasks to DSPs? "Best" was defined as "minimizing the load of the most loaded DSP". There were different types of tasks, and various task types had various performance ramifications depending on where they were assigned, so I encoded the set of job-to-DSP assignments as a "DNA string" and then used a genetic algorithm to "breed" the best assignment string I could.
It worked fairly well (much better than my previous method, which was to evaluate every possible combination... on non-trivial problem sizes, it would have taken years to complete!), the only problem was that there was no way to tell if the optimal solution had been reached or not. You could only decide if the current "best effort" was good enough, or let it run longer to see if it could do better.
There was an competition on codechef.com (great site by the way, monthly programming competitions) where one was supposed to solve an unsolveable sudoku (one should come as close as possible with as few wrong collumns/rows/etc as possible).What I would do, was to first generate a perfect sudoku and then override the fields, that have been given. From this pretty good basis on I used genetic programming to improve my solution.I couldn't think of a deterministic approach in this case, because the sudoku was 300x300 and search would've taken too long.
In a seminar in the school, we develop an application to generate music based in the musical mode. The program was build in Java and the output was a midi file with the song. We using distincts aproachs of GA to generate the music. I think this program can be useful to explore new compositions.
in undergrad, we used NERO (a combination of neural network and genetic algorithm) to teach in-game robots to make intelligent decisions. It was pretty cool.
I developed a multithreaded swing based simulation of robot navigation through a set of randomized grid terrain of food sources and mines and developed a genetic algorithm based strategy of exploring the optimization of robotic behavior and survival of fittest genes for a robotic chromosome. This was done using charting and mapping of each iteration cycle.
Since, then I have developed even more game behavior. An example application I built recently for myself was a genetic algorithm for solving the traveling sales man problem in route finding in UK taking into account start and goal states as well as one/multiple connection points, delays, cancellations, construction works, rush hour, public strikes, consideration between fastest vs cheapest routes. Then providing a balanced recommendation for the route to take on a given day.
Generally, my strategy is to use POJO based representaton of genes then I apply specific interface implementations for selection, mutation, crossover strategies, and the criteria point. My fitness function then basically becomes a quite complex based on the strategy and criteria I need to apply as a heuristic measure.
I have also looked into applying genetic algorithm into automated testing within code using systematic mutation cycles where the algorithm understands the logic and tries to ascertain a bug report with recommendations for code fixes. Basically, a way to optimize my code and provide recommendations for improvement as well as a way of automating the discovery of new programmatic code. I have also tried to apply genetic algorithms to music production amongst other applications.
Generally, I find evolutionary strategies like most metaheuristic/global optimization strategies, they are slow to learn at first but start to pick up as the solutions become closer and closer to goal state and as long as your fitness function and heuristics are well aligned to produce that convergence within your search space.
I once tried to make a computer player for the game of Go, exclusively based on genetic programming. Each program would be treated as an evaluation function for a sequence of moves. The programs produced weren't very good though, even on a rather diminuitive 3x4 board.
I used Perl, and coded everything myself. I would do things differently today.
After reading The Blind Watchmaker, I was interested in the pascal program Dawkins said he had developed to create models of organisms that could evolve over time. I was interested enough to write my own using Swarm. I didn't make all the fancy critter graphics he did, but my 'chromosomes' controlled traits which affected organisms ability to survive. They lived in a simple world and could slug it out against each other and their environment.
Organisms lived or died partly due to chance, but also based on how effectively they adapted to their local environments, how well they consumed nutrients & how successfully they reproduced. It was fun, but also more proof to my wife that I am a geek.
It was a while ago, but I rolled a GA to evolve what were in effect image processing kernels to remove cosmic ray traces from Hubble Space Telescope (HST) images. The standard approach is to take multiple exposures with the Hubble and keep only the stuff that is the same in all the images. Since HST time is so valuable, I'm an astronomy buff, and had recently attended the Congress on Evolutionary Computation, I thought about using a GA to clean up single exposures.
The individuals were in the form of trees that took a 3x3 pixel area as input, performed some calculations, and produced a decision about whether and how to modify the center pixel. Fitness was judged by comparing the output with an image cleaned up in the traditional way (i.e. stacking exposures).
It actually sort of worked, but not well enough to warrant foregoing the original approach. If I hadn't been time-constrained by my thesis, I might have expanded the genetic parts bin available to the algorithm. I'm pretty sure I could have improved it significantly.
Libraries used: If I recall correctly, IRAF and cfitsio for astronomical image data processing and I/O.
I experimented with GA in my youth. I wrote a simulator in Python that worked as follows.
The genes encoded the weights of a neural network.
The neural network's inputs were "antennae" that detected touches. Higher values meant very close and 0 meant not touching.
The outputs were to two "wheels". If both wheels went forward, the guy went forward. If the wheels were in opposite directions, the guy turned. The strength of the output determined the speed of the wheel turning.
A simple maze was generated. It was really simple--stupid even. There was the start at the bottom of the screen and a goal at the top, with four walls in between. Each wall had a space taken out randomly, so there was always a path.
I started random guys (I thought of them as bugs) at the start. As soon as one guy reached the goal, or a time limit was reached, the fitness was calculated. It was inversely proportional to the distance to the goal at that time.
I then paired them off and "bred" them to create the next generation. The probability of being chosen to be bred was proportional to its fitness. Sometimes this meant that one was bred with itself repeatedly if it had a very high relative fitness.
I thought they would develop a "left wall hugging" behavior, but they always seemed to follow something less optimal. In every experiment, the bugs converged to a spiral pattern. They would spiral outward until they touched a wall to the right. They'd follow that, then when they got to the gap, they'd spiral down (away from the gap) and around. They would make a 270 degree turn to the left, then usually enter the gap. This would get them through a majority of the walls, and often to the goal.
One feature I added was to put in a color vector into the genes to track relatedness between individuals. After a few generations, they'd all be the same color, which tell me I should have a better breeding strategy.
I tried to get them to develop a better strategy. I complicated the neural net--adding a memory and everything. It didn't help. I always saw the same strategy.
I tried various things like having separate gene pools that only recombined after 100 generations. But nothing would push them to a better strategy. Maybe it was impossible.
Another interesting thing is graphing the fitness over time. There were definite patterns, like the maximum fitness going down before it would go up. I have never seen an evolution book talk about that possibility.

How to get scientific results from non-experimental data (datamining?)

I want to obtain maximum performance out of a process with many variables, many of which cannot be controlled.
I cannot run thousands of experiments, so it'd be nice if I could run hundreds of experiments and
vary many controllable parameters
collect data on many parameters indicating performance
'correct,' as much as possible, for those parameters I couldn't control
Tease out the 'best' values for those things I can control, and start all over again
It feels like this would be called data mining, where you're going through tons of data which doesn't immediately appear to relate, but does show correlation after some effort.
So... Where do I start looking at algorithms, concepts, theory of this sort of thing? Even related terms for purposes of search would be useful.
Background: I like to do ultra-marathon cycling, and keep logs of each ride. I'd like to keep more data, and after hundreds of rides be able to pull out information about how I perform.
However, everything varies - routes, environment (temp, pres., hum., sun load, wind, precip., etc), fuel, attitude, weight, water load, etc, etc, etc. I can control a few things, but running the same route 20 times to test out a new fuel regime would just be depressing, and take years to perform all the experiments that I'd like to do. I can, however, record all these things and more(telemetry on bicycle FTW).
It sounds like you want to do some regression analysis. You certainly have plenty of data!
Regression analysis is an extremely common modeling technique in statistics and science. (It could be argued that statistics is the art and science of regression analysis.) There are many statistics packages out there to do the computation you'll need. (I'd recommend one, but I'm years out of date.)
Data mining has gotten a bad name because far too often people assume correlation equals causation. I found that a good technique is to start with variables you know have an influence and build a statistical model around them first. So you know that wind, weight and climb have an influence on how fast you can travel and statistical software can take your dataset and calculate what the correlation between those factors are. That will give you a statistical model or linear equation:
speed = x*weight + y*wind + z*climb + constant
When you explore new variables, you will be able to see if the model is improved or not by comparing a goodness of fit metric like R-squared. So you might check if temperature or time of day adds anything to the model.
You may want to apply a transformation to you data. For instance, you might find that you perform better on colder days. But really cold days and really hot days might hurt performance. In that case, you could assign temperatures to bins or segments: < 0°C; 0°C to 40°C; > 40°C, or some such. The key is to transform the data in a way that matches a rational model of what is going on in the real world, not just the data itself.
In case someone thinks this is not a programming related topic, notice that you can use these same techniques to analyze system performance.
With that many variables you have too many dimensions and you may want to look at Principal Component Analysis. It takes some of the "art" out of regression analysis and lets the data speak for itself. Some software to do that sort of analysis is shown at the bottom of the link.
I have used the Perl module Statistics::Regression for somewhat similar problems in the past. Be warned, however, that regression analysis is definitely an art. As the warning in the Perl module says, it won't make sense to you if you haven't learned the appropriate math.

Resources