How to get scientific results from non-experimental data (datamining?) - algorithm

I want to obtain maximum performance out of a process with many variables, many of which cannot be controlled.
I cannot run thousands of experiments, so it'd be nice if I could run hundreds of experiments and
vary many controllable parameters
collect data on many parameters indicating performance
'correct,' as much as possible, for those parameters I couldn't control
Tease out the 'best' values for those things I can control, and start all over again
It feels like this would be called data mining, where you're going through tons of data which doesn't immediately appear to relate, but does show correlation after some effort.
So... Where do I start looking at algorithms, concepts, theory of this sort of thing? Even related terms for purposes of search would be useful.
Background: I like to do ultra-marathon cycling, and keep logs of each ride. I'd like to keep more data, and after hundreds of rides be able to pull out information about how I perform.
However, everything varies - routes, environment (temp, pres., hum., sun load, wind, precip., etc), fuel, attitude, weight, water load, etc, etc, etc. I can control a few things, but running the same route 20 times to test out a new fuel regime would just be depressing, and take years to perform all the experiments that I'd like to do. I can, however, record all these things and more(telemetry on bicycle FTW).

It sounds like you want to do some regression analysis. You certainly have plenty of data!
Regression analysis is an extremely common modeling technique in statistics and science. (It could be argued that statistics is the art and science of regression analysis.) There are many statistics packages out there to do the computation you'll need. (I'd recommend one, but I'm years out of date.)
Data mining has gotten a bad name because far too often people assume correlation equals causation. I found that a good technique is to start with variables you know have an influence and build a statistical model around them first. So you know that wind, weight and climb have an influence on how fast you can travel and statistical software can take your dataset and calculate what the correlation between those factors are. That will give you a statistical model or linear equation:
speed = x*weight + y*wind + z*climb + constant
When you explore new variables, you will be able to see if the model is improved or not by comparing a goodness of fit metric like R-squared. So you might check if temperature or time of day adds anything to the model.
You may want to apply a transformation to you data. For instance, you might find that you perform better on colder days. But really cold days and really hot days might hurt performance. In that case, you could assign temperatures to bins or segments: < 0°C; 0°C to 40°C; > 40°C, or some such. The key is to transform the data in a way that matches a rational model of what is going on in the real world, not just the data itself.
In case someone thinks this is not a programming related topic, notice that you can use these same techniques to analyze system performance.

With that many variables you have too many dimensions and you may want to look at Principal Component Analysis. It takes some of the "art" out of regression analysis and lets the data speak for itself. Some software to do that sort of analysis is shown at the bottom of the link.

I have used the Perl module Statistics::Regression for somewhat similar problems in the past. Be warned, however, that regression analysis is definitely an art. As the warning in the Perl module says, it won't make sense to you if you haven't learned the appropriate math.

Related

What type of algorithm should I use for forecasting with only very little historic data?

The problem is as follows:
I want to use a forecasting algorithm to predict heat demand of a not further specified household during the next 24 hours with a time resolution of only a few minutes within the next three or four hours and lower resolution within the following hours.
The algorithm should be adaptive and learn over time. I do not have much historic data since in the beginning I want the algorithm to be able to be used in different occasions. I only have very basic input like the assumed yearly heat demand and current outside temperature and time to begin with. So, it will be quite general and unprecise at the beginning but learn from its Errors over time.
The algorithm is asked to be implemented in Matlab if possible.
Does anyone know an apporach or an algortihm designed to predict sensible values after a short time by learning and adapting to current incoming data?
Well, this question is quite broad as essentially any algorithm for forcasting or data assimilation could do this task in principle.
The classic approach I would look into first would be Kalman filtering, which is a quite general approach at least once its generalizations to ensemble Filters etc. are taken into account (This is also implementable in MATLAB easily).
https://en.wikipedia.org/wiki/Kalman_filter
However the more important part than the actual inference algorithm is typically the design of the model you fit to your data. For your scenario you could start with a simple prediction from past values and add daily rhythms, influences of outside temperature etc. The more (correct) information you put into your model a priori the better your model should be at prediction.
For the full mathematical analysis of this type of problem I can recommend this book: https://doi.org/10.1017/CBO9781107706804
In order to turn this into a calibration problem, we need:
a model that predicts the heat demand depending on inputs and parameters,
observations of the heat demand.
Calibrating this model means tuning the parameters so that the model best predicts the heat demand.
If you go for Python, I suggest to use OpenTURNS, which provides several data assimilation methods, e.g. Kalman filtering (also called BLUE):
https://openturns.github.io/openturns/latest/user_manual/calibration.html

What is the 'predictive' element of machine learning

I'm hoping someone with a lot more knowledge of machine learning can help me out here. I've been reading examples of regression and classification and I always seem to come back to the question 'what is really the difference between what this algorithm is doing and what standard statistical analysis would do'.
Specifically, none of the examples I read seem to discuss the predictive element. For example, when looking at linear regression the articles commonly explain the concept of trying to create a 'best fit' - the combination of a linear equation and then iterating a cost function until it reaches a minimum. Of course, throughout a lot of emphasis is put on a 'training data set'. No problem... but this is usually where it ends. At this point I can't see the difference between the above and the standard way in which one would carry out statistical analysis on a data set that was assumed to have a linear relationship. Presumably, future values here are 'predicted' from the equation that was produced when the cost function converged on a minimum - again, there doesn't seem to be much 'learning' here as this is exactly what would be done in the usual case.
After a long winded intro... what I'm trying to ask is how has the algorithm learned from the original training data? and how does this training set help with future data sets? (again, this is where I get a bit lost - to me it seems that you would give it a new data set and carry out the same task of minimising the cost function - however, this time you have a better 'starting' point but all of your knowledge really comes from what you already 'knew' about the dataset i.e that one assumed a linear relationship).
I hope this makes sense - it's clearly a lack of understanding, but I'm hoping someone can shove me in the right direction.
Thanks!
You are right, there is no difference. Linear regression is purely a statistical method, and "fitting" would probably be more accurate than "learning" in this case. But again, this is usually just the first lecture on the subject. There many approaches where the differences are much clearer, for example SVMs. There are also approaches where the "learning" aspect is much clearer, eg using reirforcement learning in games, where you can actually see your system improve its performance with experience.
Anyway, the main subject of machine learning is learning from examples. You are given a list of 100 patients, along with blood pressure, age, cholesterol level etc, and for each of them you are told whether they have heart disease or not. Then, you are given a patient that you had not seen before. Does he have heart disease?? Most people call this prediction. You might prefer to call it fitting, or anything else. But the fact is, it usually works quite well.
Still, the subject remains closely tied to statistics, and indeed, you need to make some assumptions (to a larger or smaller extent, depending on the algorithm) about the underlying function. It is not perfect, but in many cases it's the best thing we have, so I would say it is worth studying. If you are starting now, there is a great online course, Stanford's "Statistical Learning", which deals with the subject from your point of view.

What Machine Learning algorithm would be appropriate?

I am working on a predictor for learning the most likely period for grape harvesting, depending on weather and on the characteristics of grape, namely sugar level, Ph, acidity. I've got two datasets and I am thinking of how to merge them together: one is the pre-harvest analysis data of some Italian vineyards in the 2003-2013 period, the other is the weather on that decade. What I want to do is learning from my samples when to harvest, given a range for the optimal sugar level, Ph and acidity, and given a weather forecast.
I thought that some Reinforcement Learning approach could work. Since the pre-harvest analysis are done about 5 times during the grape maturation period, I thought that those could be states I step in, while the weather conditions could be the "probabilities" of going from a state to another.
Yet I am not sure of what algorithm would be the best as every state and every "probability" depends on several variables. I was told that Hidden Markov Model would work, but it seems to me that my problem doesn't fit the model perfectly.
Do you have any suggestion? Thx in advance
This has nothing to do with the actual algorithm, but the problem you are going to run into here is that weather is extremely local. One vineyard can have completely different weather than another only a mile away from it, believe or not. If you put rain gauges at each vineyard, you will find this out. To get really good results you need to have a mini weather station at each vineyard. Absent this, your best option is to use only vineyards in the immediate vicinity of the weather measurements. For example, if your data is from an airport, only use vineyards right next to the airport.
Reinforcement learning is appropriate when you can control the action. It is like a monkey pushing buttons. You push a button and get shocked, so you don't push that button again. Here you have a passive data set and cannot conduct experimental actions, so reinforcement learning does not apply.
Here you have a complex set of uncontrolled inputs, the weather data, a controlled input (harvest time), and several output parameters, sugar etc. Given that data, you want to predict what harvest time to use for some future, unknown weather pattern.
In general, what you are doing is sensitivity analysis: trying to figure out how your factors affected the outcome that occurred. The tricky part is that the outcomes may be driven by some non-obvious pattern. For example, maybe 3 weeks of drought, followed by 2 weeks of heavy rain implies the best harvest will be 65 days hence, or something like that.
So, what you have to do is featurize the data to characterize it in possible likely ways, then do a sensitivity analysis. If the analysis has a strong correlation, then you have found a solution. If it does not, then you have to find a different way to featurize the data. For example, your featurization might be number of days with rain over 2 inches, or it might be most number of days without rain, or it might be total number of days with bright sunshine. Possibly multiple features might combine to make a solution. The options are limited only by your imagination.
Of course, as I was saying above, the fly in the ointment is that your weather data will only roughly approximate the real and actual weather at the particular vineyard, so there will be noise in the data, possibly so much noise as to make getting a good result impossible.
Why you actually don't care too much about the weather
Getting back to the data, having unreliable weather information is actually not a problem, because you actually don't care too much about the weather. The reason is two-fold. First of all, the question you are trying to answer is not when to harvest the grapes, it is whether to wait to harvest or not. The vintner can always measure the current sugar of the grapes. So, he just has to decide, "Should I harvest the grapes now with sugar X%, or should I wait and possibly get a better sugar Z% later? To answer this question the real data you need is not the weather, it is a series of sugar/acidity readings taken over time. What you want to predict is whether, given a situation, the grapes will get better or whether they will get worse.
Secondly, grapevines have an optimal amount of moisture they like. If the vine gets too dry, that is bad, if it gets too wet that is bad. You cannot predict how moist a vine is from the weather. Some soils hold moisture well, others are sandy. A sandy vineyard will require more rain than a clay vineyard to have the same moisture levels. Also, the vintner can water his vineyards, completely invalidating the rainfall pattern. Therefore, weather is pretty much a non-factor.
I agree with Tyler that from a feasible standpoint weather might harm your analysis. However, I think this is for you to test and find out!- there could be some interesting data that comes out of it.
I'm not sure exactly what your test is, but a simple way to start perhaps is to make this into a classification problem using svm (or even logistic regression since you want probabilities) and use all the data as the input for the algorithm- assuming you know which years were good harvest years or not. You could even test each variable individually and see how it effects your performance. I suggest you go this way if you can just because there's massive amounts of sources on the net and people here on SO that can help you tune your algo.
When you have a handle on this, I would, as you seem to have been suggested before, try the HMM- as it will tell you which day was probably the best for the harvest. This is where the weather might hurt, but you'll come to understand more about your data from the simpler experiments.
The think I've learned about machine learning is that while there are guidelines for when to choose which algorithm its not always set in stone and you can change your question slightly and try a new approach to the problem, depending how much freedom you have to play with the data. Good luck and have fun!

Weather prediction algorithm variety

Currently there's a big 'storm' over the predictions by the MetOffice in the UK. They predicted a mild, wet winter, while we have the coldest temperature on record in Northern Ireland and solid snow on the ground, normally rare in December.
It's something I'd love to have a play with, not that I'm claiming I can beat them, but was wondering what algorithms are out there currently that people are working with? What datasets do they base it on?
Possibilities presumably include neural networks modelling input with fitness being the accuracy of the prediction, complex mathematical models, or even the 'same as yesterday' prediction which I've heard claim (although not seen evidence) that it's more reliable for single-day prediction (although obviously drops off after that).
Ideally like to hear from some developers in weather centres or who get access to the supercomputers, it'd be interesting to hear approaches...
In short, if you intend to build and run your own forecasting model, you will face three major problems:
Access to observations
Development of a mathematical model
Computational power to run your model
Access to observation
As far as I know, access to good meteorological observations costs a lot of money.
You need to have observations from all over the globe and model the state of oceans and atmosphere for the whole planet. Alternatively, you need to obtain so-called lateral boundary conditions from someone who calculates a global model.
Development of a mathematical model
I'm not and I've never been affiliated with Met Office, but I used to port and optimize a version of their Unified Model to a supercomputer at our center a couple of years ago. Here's how I remember the model.
Met Office has been developing their Unified Model for the last 20+ years, we're talking about millions of lines of code that contain state of the art ocean/atmospheric models and numerical algorithms. Check out this section of (outdated) User Guide for a glimpse of scientific methods used in their model. It's a fruit of, give or take, half a century of well-funded, extensive research by a large community of smart people. If there was a simple solution that would consistently give better results than the complex models, someone would've probably implemented it by now.
To conclude, I guess it's very hard to get even remotely satisfactory results in weather forecasting by building a model from scratch, unless you're a MSc/PhD in atmospheric physics and you've got a couple of years of free time on your hands.
Computational power to run your model
The first forecasting models were run in the middle of 20th century on machines that cannot match with today's cellphones, so, technically, you could calculate something on your PC. However, this type of job is often done on very, very powerful machines. In fact, 10 systems in the Top500 are dedicated solely to weather forecasting and climate research.
Interesting reads
http://en.wikipedia.org/wiki/Weather_forecasting#How_models_create_forecasts
http://en.wikipedia.org/wiki/Numerical_weather_prediction
http://research.metoffice.gov.uk/research/nwp/numerical/operational/index.html
http://ncas-cms.nerc.ac.uk/html_umdocs/UM55_User_Guide/
UPDATE It's possible to obtain the source code of the WRF model for free, together with some met data. Note that WRF, Unified Model, COAMPS, and many other models are written primarily in Fortran.
First off, you can import raw data from http://tgftp.nws.noaa.gov and other weather data. The best way for the computer to understand the data is putting it on a map. Each point on the map reacts with each other. Data at each point can represent Temp, Pressure, Wind and Direction, Cloud Coverage, Where sun is in the sky, Visibility, last 100hrs of precipitation. You could make predictions, then compare them later to the actual predictions as well as the Weather Service's predictions. Then update a climate model for that data point. That way, it could be a self learning neural network. As far as computation power is concerned, Get a Titan, Big Mac!
It seems to be possible to construct simple forecast model. My watch features a barometer and a thermometer (which is not usable at all, because the watch is warmed by the hand). Solely on those measurements, it has several times warned me of incoming rain, in spite of sunny forecasts from internet sites. (the cloud picture at upper left corner)
A quick search leads us to the Sager Algorithm, which uses only very simple input data. However, while the implementation claims to be open-source, I have failed to locate both the code and scientific papers on the algorithm.

using software metrics for measuring productivity of pair programming

what are the software metrics that can be used to measuring the performance of pair programming ?
to be clear
is there any metrics used to measure pair programming specifically and does not use to measure the individual programmer ? what are the parameters used for measuring ?
for example:if we want to measure the cost for both individual and pair programming
let's assume that for the individual programming Cost = x so for the pair will be Cost = 2* x
right
and the same for the time for individual Time = t while for pair Time = 2* t
so if I would like to use Lines of code for measuring the product size , is there any different between individual and pair by using this metric?
any idea
Sorry to spoil your party, but lines of code is one of the worst metrics possible, especially if people know their assessment or bonus is in any way tied to the metric. It actively encourages cut and paste programming and other attrocities. It's more effort, but why don't you categorise the workload in terms of expected effort for one person, based on your historical data? Or, get some programmers to agree to do a few projects redundantly, rotating between pair-programming and individual, so you can see how the same programmers go at each. As one good programmer can be more productive than two average programmers (I vaguely remember an old IBM study concluding someone in the top percentile was 27x more productive than median), it's useful to see the same programmers doing it both ways. If objectively discovering the right process through such an experiment is too costly in terms of lost short-term productivity, then you're better off not bothering with the LOC metrics anyway... good programmers knowing their work arrangements are being based on such will probably be highly unimpressed.
Remember that there are also intangibles involved... pair programming - IMHO - forces people to keep focused, and to make design decisions that are more rounded and professional. Just the social contact can help relieve boredom, though it may stress some people too. My suspicion is that - whether or not it's faster to begin with - it makes for better, more maintainable results. It also ensures skill and knowledge transfer. You should factor in such intangible aspects as best you can - maybe doing interviews or anonymous surveys with the trial participants.
I guess what you try to ask, is how to measure efficiency of the team that uses pair programming. If yes, then answer is the measurement of efficiency doesn't depend on method or proccess of work team is using. You should try to evaluate the quality of their product releases, with metrics like number of issues identified post release. Probably the velocity.
and, please, don't use lines of code for efficiency measurement. It doesn't make sense. Lines of code is a measure of product size and not developer efficiency. It's like using height or weight to judge how smart you are. There is no correlation between amount of code and individual efficiency.
if you are interested in more software metrics, take a look at http://www.sdlcmetrics.org

Resources