Predicting missing data values in a database - algorithm

I have a database, consisting of a whole bunch of records (around 600,000) where some of the records have certain fields missing. My goal is to find a way to predict what the missing data values should be (so I can fill them in) based on the existing data.
One option I am looking at is clustering - i.e. representing the records that are all complete as points in some space, looking for clusters of points, and then when given a record with missing data values try to find out if there are any clusters that could belong in that are consistent with the existing data values. However this may not be possible because some of the data fields are on a nominal scale (e.g. color) and thus can't be put in order.
Another idea I had is to create some sort of probabilistic model that would predict the data, train it on the existing data, and then use it to extrapolate.
What algorithms are available for doing the above, and is there any freely available software that implements those algorithms (This software is going to be in c# by the way).

This is less of an algorithmic and more of a philosophical and methodological question. There are a few different techniques available to tackle this kind of question. Acock (2005) gives a good introduction to some of the methods. Although it may seem that there is a lot of math/statistics involved (and may seem like a lot of effort), it's worth thinking what would happen if you messed up.
Andrew Gelman's blog is also a good resource, although the search functionality on his blog leaves something to be desired...
Hope this helps.
Acock (2005)
http://oregonstate.edu/~acock/growth-curves/working%20with%20missing%20values.pdf
Andrew Gelman's blog
http://www.stat.columbia.edu/~cook/movabletype/mlm/

Dealing with missing values is a methodical question that has to do with the actual meaning of the data.
Several methods you can use (detailed post on my blog):
Ignore the data row. This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you'll obviously get poor performance if the percentage of such rows is high
Use a global constant to fill in for missing values. Like "unknown", "N/A" or minus infinity. This is used because sometimes is just doesnt make sense to try and predict the missing value. For example if you have a DB if, say, college candidates and state of residence is missing for some, filling it in doesn't make much sense...
Use attribute mean. For example if the average income of a US family is X you can use that value to replace missing income values.
Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to "Luxury" and "Low budget" and you're dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you'd get if you factor in the low budget cars
Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms used to generate input for step method #4 (K-Mean\Median etc.)
I'd suggest looking into regression and decision trees first (ID3 tree generation) as they're relatively easy and there are plenty of examples on the net.
As for packages, if you can afford it and you're in the Microsoft world look at SQL Server Analysis Services (SSAS for short) that implement most of the mentioned above.
Here are some links to free data minning software packages:
WEKA - http://www.cs.waikato.ac.nz/ml/weka/index.html
ORANGE - http://www.ailab.si/orange
TANAGRA - http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html
Although not C# he's a pretty good intro to decision trees and baysian learning (using Ruby):
http://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/
http://www.igvita.com/2007/05/23/bayes-classification-in-ruby/
There's also this Ruby library that I find very useful (also for learning purposes):
http://ai4r.rubyforge.org/machineLearning.html
There should be plenty of samples for these algorithms online in any language so I'm sure you'll easily find C# stuff too...
Edited:
Forgot this in my original post. This is a definately MUST HAVE if you're playing with data mining...
Download Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007 (It requires SQL Server Analysis Services - SSAS - which isn't free but you can download a trial).
This will allow you to easily play and try out the different techniques in Excel before you go and implement this stuff yourself. Then again, since you're in the Microsoft ecosystem, you might even decide to go for an SSAS based solution and count on the SQL Server guys to do it for ya :)

Predicting missing values is generally considered to be part of data cleansing phase which needs to be done before the data is mined or analyzed further. This is quite prominent in real world data.
Please have a look at this algorithm http://arxiv.org/abs/math/0701152
Currently Microsoft SQL Server Analysis Services 2008 also comes with algorithms like these http://technet.microsoft.com/en-us/library/ms175312.aspx which help in predictive modelling of attributes.
cheers

Related

Kedro Data Modelling

We are struggling to model our data correctly for use in Kedro - we are using the recommended Raw\Int\Prm\Ft\Mst model but are struggling with some of the concepts....e.g.
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
Is it OK for a primary dataset to consume data from another primary dataset?
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
I appreciate there are no hard & fast rules with data modelling but these are big modelling decisions & any guidance or best practice on Kedro modelling would be really helpful, I can find just one table defining the layers in the Kedro docs
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
Great question. As you say, there are no hard and fast rules here and opinions do vary, but let me share my perspective as a QB data scientist and kedro maintainer who has used the layering convention you referred to several times.
For a start, let me emphasise that there's absolutely no reason to stick to the data engineering convention suggested by kedro if it's not suitable for your needs. 99% of users don't change the folder structure in data. This is not because the kedro default is the right structure for them but because they just don't think of changing it. You should absolutely add/remove/rename layers to suit yourself. The most important thing is to choose a set of layers (or even a non-layered structure) that works for your project rather than trying to shoehorn your datasets to fit the kedro default suggestion.
Now, assuming you are following kedro's suggested structure - onto your questions:
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
In the case of simple features, a feature dataset can be very similar to a primary one. The distinction is maybe clearest if you think about more complex features, e.g. formed by aggregating over time windows. A primary dataset would have a column that gives a cleaned version of the original data, but without doing any complex calculations on it, just simple transformations. Say the raw data is the colour of all cars driving past your house over a week. By the time the data is in primary, it will be clean (e.g. correcting "rde" to "red", maybe mapping "crimson" and "red" to the same colour). Between primary and the feature layer, we will have done some less trivial calculations on it, e.g. to find one-hot encoded most common car colour each day.
Is it OK for a primary dataset to consume data from another primary dataset?
In my opinion, yes. This might be necessary if you want to join multiple primary tables together. In general if you are building complex pipelines it will become very difficult if you don't allow this. e.g. in the feature layer I might want to form a dataset containing composite_feature = feature_1 * feature_2 from the two inputs feature_1 and feature_2. There's no way of doing this without having multiple sub-layers within the feature layer.
However, something that is generally worth avoiding is a node that consumes data from many different layers. e.g. a node that takes in one dataset from the feature layer and one from the intermediate layer. This seems a bit strange (why has the latter dataset not passed through the feature layer?).
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
Building features from the intermediate layer isn't unheard of, but it seems a bit weird. The primary layer is typically an important one which forms the basis for all feature engineering. If your data is in a shape that you can build features then that means it's probably primary layer already. In this case, maybe you don't need an intermediate layer.
The above points might be summarised by the following rules (which should no doubt be broken when required):
The input datasets for a node in layer L should all be in the same layer, which can be either L or L-1
The output datasets for a node in layer L should all be in the same layer L, which can be either L or L+1
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
I'm also interested in seeing what others think here! One possibly useful thing to note is that kedro was inspired by cookiecutter data science, and the kedro layer structure is an extended version of what's suggested there. Maybe other projects have taken this directory structure and adapted it in different ways.
Your question prompted us to write a Medium article better explaining these concepts, it's just been published on Toward Data Science

Which open-source recommendation system should I choose to deal with big dataset

I want to build a recommendation system, and the target is to deal with really big data set, like 1 TB data.
And each user has really huge amount of items, however the number of user is small, like thousands or 10 thousands.
I have search from google, I found there is some open-source recommendation engine based on hadoop like Mahout, I guess it may have ability to deal with such big data, however I'm not sure.
I also find some engine write in C++ python, even php, I don't think script languages can deal with such big data, cause memory can't contain the whole dataset.
Or I'm wrong? Could some give me some recommendation?
Your question title is:
Which opensource recommendation system should I choose to deal with
big dataset?
and in the first line you say
I want to build a recommendation system, and the target is to deal with really big data set, > like 1 TB data.
And you are asking for an recommendation as an answer.
To answer your second question first. In my experience of building recommender systems I would advise you do not "build" a recommender system from the ground up if you can avoid it. Recommender Systems are complex and can use a wide range of techniques to provide a user with a recommendation. So my recommendation is unless you are really committed, and have a team of people with a range of experience and knowledge in recommender systems, statistics, and software engineering then look to implement an existing recommender system rather than building your own.
In terms of which open source recommender system you should choose, this is actually pretty difficult to answer with great accuracy. Let me try to answer this by breaking it down.
Consider the open source license, its restrictions and your requirements.
Consider which algorithm you want to use to make recommendations
Consider the environment you will be running your recommender system on.
I recommend you look more into the algorithm side as it will be the determining factor as to which tool you can use, or whether you will need to roll your own. Start reading here http://www.ibm.com/developerworks/library/os-recommender1/ for a very brief insight in to the different approaches that recommender systems use. In summary the different approaches are:
Content based
Neighbourhood / Collaborative filtering based
Constraint based
Graph-based
In your case to keep things relatively straightforward it sounds like you should consider a user-user collaborative filtering algorithm for this. The reasons being:
Neighbourhood Collaborative Filtering is quite intuitive to understand and it can be relatively easy to implement.
With this method you can also justify your recommendations to your users in a basic way
There is no requirement to build a model for training, and the processing of neighbours can be done "offline", to provide quick recommendations to the end user.
Storing neighbours is actually quite memory efficient, which means better scalability. Something it sounds like you will need lots of.
The user-based part of my suggestion is because it sounds like you have less users than you do items. In a user-based nearest neighbourhood a predicted rating of a new item I for user U is calculated by looking at the other users who have also rated item I and are most similar to user U. Because you have fewer users than items in your system it will be faster to compute user-based collaborative filtering compared with item-based collaborative filtering.
Within the user-based collaborative filtering you need to consider what rating normalisation (mean-centering vs z-score) you want to use, the similarity weight computation method (e.g. Cosine vs Pearsons correlation vs other similarity measures) you want to use, neighbourhood selection criteria (pre-filtering of neighbours, number of neighbours involved in the prediction), and any Dimensionality Reduction methods (SVD, SVD++) you want to implement (with a large dataset like yours you will want to seriously consider DM).
So really instead of looking for an open source that will be able to process your data set you should consider your algorithm choice first, then look to find a tool that has an implementation of this algorithm, and then assess whether it can process your the volume involved in your dataset.
In saying all of that, if you do choose to go down the user-based collaborative filtering route then I am confident that Apache Mahout will be able to solve your problem, and if not it will certainly help you understand the complexity involved in building your own (just look at their source code).
Please note the advice is really consider the algorithm choice. "Good" recommender systems are so much more than just being able to process a large dataset. You need to think about accuracy, coverage, confidence, novelty, serendipity, diversity, robustness, privacy, risk user trust, and finally scalability. You should also consider how you are going to perform experiments and evaluate your recommendations, remember if the recommendations you are churning out are rubbish and it is turning your users off then there is no point to have a recommender system!
It is such a big area with lots to think about, there is probably no one single tool that is going to help you with everything, so be prepared to do a lot of reading and research as well as implementing lots of different open source tools to help you.
In saying that, start looking at Apache Mahout. Going back to the break-down of the 3 areas I said you should think about.
It has a commercial-friendly open-source license,
it has really great implementation of the algorithms you are likely going to need to use, and
it can work on distributed environments (read scalable).
Hope that helps, and good luck.

Data matching Algorithm Approach

I don't really know where to start with this project, and so I'm hoping a broad question can at least point me in the right direction.
I have 2 data sets right now, each about 5gb with 2million observations. They are the assessed and historical data gathered for property listings of a given area for a certain amount of time. What I need to do is match properties to one another. So a property may arise in the historical since it gets sold 2 or 3 times during the period. In this historical I have the seller info, the loan info, and sale info. In the assessor data I have all of the characteristics that would describe the property sold. So in order to do any pricing model, I need to match the two.
I have variables that are similar in each, however they are going to differ slightly (misspellings, abbreviations, etc). Does anyone have any recommendations for me about going through this? First off, what program would I want to do this in? I have experience in STATA, R and a little bit of SAS and Matlab, but I'd prefer to use the former two.
I read through this:
Data matching algorithm
Where he uses .NET and one user suggested a Levenshtein approach (where the distance between strings is calculated) so for fields like Address I could use this and weight the approximate accuracy between the two string. Then it was suggested maybe to use Soundex for maybe Name of the seller/owner.
But I'm really lost in how to implement any of this, and before I approach anyone in my department I really need to have some sort of idea of what I'm doing!
Any help or advice would be immensely helpful.
Yes, there are several good algorithms for the string matching problem you describe, namely:
jaro-winkler,
smith-waterman,
dice-sorense
soundex
damerau-levenshtein, and
monge-elkan
to name the few.
I recommend A Comparison of String Distance Metrics for Name-Matching Tasks, by W. W. Cohen, P. Ravikumar, S. Fienberg for an overview of what might be working the best for what.
SoftTFIDF claims to be the best one. It is available as a Java package. There are other implementations of string matching and record linkage algorithms available in:
Java (SecondString),
Python (JellyFish),
C# (FuzzyString), and
Scala StringMetric
libraries.

trouble with recurrent neural network algorithm for structured data classification

TL;DR
I need help understanding some parts of a specific algorithm for structured data classification. I'm also open to suggestions for different algorithms for this purpose.
Hi all!
I'm currently working on a system involving classification of structured data (I'd prefer not to reveal anything more about it) for which I'm using a simple backpropagation through structure (BPTS) algorithm. I'm planning on modifying the code to make use of a GPU for an additional speed boost later, but at the moment I'm looking for better algorithms than BPTS that I could use.
I recently stumbled on this paper -> [1] and I was amazed by the results. I decided to give it a try, but I have some trouble understanding some parts of the algorithm, as its description is not very clear. I've already emailed some of the authors requesting clarification, but haven't heard from them yet, so, I'd really appreciate any insight you guys may have to offer.
The high-level description of the algorithm can be found in page 787. There, in Step 1, the authors randomize the network weights and also "Propagate the input attributes of each node through the data structure from frontier nodes to root forwardly and, hence, obtain the output of root node". My understanding is that Step 1 is never repeated, since it's the initialization step. The part I quote indicates that a one-time activation also takes place here. But, what item in the training dataset is used for this activation of the network? And is this activation really supposed to happen only once? For example, in the BPTS algorithm I'm using, for each item in the training dataset, a new neural network - whose topology depends on the current item (data structure) - is created on the fly and activated. Then, the error backpropagates, the weights are updated and saved, and the temporary neural network is destroyed.
Another thing that troubles me is Step 3b. There, the authors mention that they update the parameters {A, B, C, D} NT times, using equations (17), (30) and (34). My understanding is that NT denotes the number of items in the training dataset. But equations (17), (30) and (34) already involve ALL items in the training dataset, so, what's the point of solving them (specifically) NT times?
Yet another thing I failed to get is how exactly their algorithm takes into account the (possibly) different structure of each item in the training dataset. I know how this works in BPTS (I described it above), but it's very unclear to me how it works with their algorithm.
Okay, that's all for now. If anyone has any idea of what might be going on with this algorithm, I'd be very interested in hearing it (or rather, reading it). Also, if you are aware of other promising algorithms and / or network architectures (could long short term memory (LSTM) be of use here?) for structured data classification, please don't hesitate to post them.
Thanks in advance for any useful input!
[1] http://www.eie.polyu.edu.hk/~wcsiu/paper_store/Journal/2003/2003_J4-IEEETrans-ChoChiSiu&Tsoi.pdf

How to get scientific results from non-experimental data (datamining?)

I want to obtain maximum performance out of a process with many variables, many of which cannot be controlled.
I cannot run thousands of experiments, so it'd be nice if I could run hundreds of experiments and
vary many controllable parameters
collect data on many parameters indicating performance
'correct,' as much as possible, for those parameters I couldn't control
Tease out the 'best' values for those things I can control, and start all over again
It feels like this would be called data mining, where you're going through tons of data which doesn't immediately appear to relate, but does show correlation after some effort.
So... Where do I start looking at algorithms, concepts, theory of this sort of thing? Even related terms for purposes of search would be useful.
Background: I like to do ultra-marathon cycling, and keep logs of each ride. I'd like to keep more data, and after hundreds of rides be able to pull out information about how I perform.
However, everything varies - routes, environment (temp, pres., hum., sun load, wind, precip., etc), fuel, attitude, weight, water load, etc, etc, etc. I can control a few things, but running the same route 20 times to test out a new fuel regime would just be depressing, and take years to perform all the experiments that I'd like to do. I can, however, record all these things and more(telemetry on bicycle FTW).
It sounds like you want to do some regression analysis. You certainly have plenty of data!
Regression analysis is an extremely common modeling technique in statistics and science. (It could be argued that statistics is the art and science of regression analysis.) There are many statistics packages out there to do the computation you'll need. (I'd recommend one, but I'm years out of date.)
Data mining has gotten a bad name because far too often people assume correlation equals causation. I found that a good technique is to start with variables you know have an influence and build a statistical model around them first. So you know that wind, weight and climb have an influence on how fast you can travel and statistical software can take your dataset and calculate what the correlation between those factors are. That will give you a statistical model or linear equation:
speed = x*weight + y*wind + z*climb + constant
When you explore new variables, you will be able to see if the model is improved or not by comparing a goodness of fit metric like R-squared. So you might check if temperature or time of day adds anything to the model.
You may want to apply a transformation to you data. For instance, you might find that you perform better on colder days. But really cold days and really hot days might hurt performance. In that case, you could assign temperatures to bins or segments: < 0°C; 0°C to 40°C; > 40°C, or some such. The key is to transform the data in a way that matches a rational model of what is going on in the real world, not just the data itself.
In case someone thinks this is not a programming related topic, notice that you can use these same techniques to analyze system performance.
With that many variables you have too many dimensions and you may want to look at Principal Component Analysis. It takes some of the "art" out of regression analysis and lets the data speak for itself. Some software to do that sort of analysis is shown at the bottom of the link.
I have used the Perl module Statistics::Regression for somewhat similar problems in the past. Be warned, however, that regression analysis is definitely an art. As the warning in the Perl module says, it won't make sense to you if you haven't learned the appropriate math.

Resources