Weka - How can I improve J48 performance? - performance

I'm working on a data mining project when I need to be able to predict chances of success on a Kickstarter project funding.
I've used a kickstarter dataset which i've found on Kaggle, and i've cleaned all the noisy data, deleted irrelevant attributes and added another useful attributes.
Now I have about 320K instances and 6 attributes.
After running J48 algorithm, I'm getting 65.07% correctly classified instances and 68.7% average roc area.
I have to get this performance improved but I dont know how.
It's a college project so I have specific rules: I can only change the Confidence Factor and NumMinObj of the algorithm.
I've spending a lot of time trying every combination.
What can I do else? Maybe something in my dataset is problematic?

You have a lot of instances, but few attributes. If you can't add more attributes then probably you already got the best result you can have with J48 trees, and feature selection is useless. You probably have to use a more complex classification algorithm such as RandomForest.

Related

What type of algorithm should I use for forecasting with only very little historic data?

The problem is as follows:
I want to use a forecasting algorithm to predict heat demand of a not further specified household during the next 24 hours with a time resolution of only a few minutes within the next three or four hours and lower resolution within the following hours.
The algorithm should be adaptive and learn over time. I do not have much historic data since in the beginning I want the algorithm to be able to be used in different occasions. I only have very basic input like the assumed yearly heat demand and current outside temperature and time to begin with. So, it will be quite general and unprecise at the beginning but learn from its Errors over time.
The algorithm is asked to be implemented in Matlab if possible.
Does anyone know an apporach or an algortihm designed to predict sensible values after a short time by learning and adapting to current incoming data?
Well, this question is quite broad as essentially any algorithm for forcasting or data assimilation could do this task in principle.
The classic approach I would look into first would be Kalman filtering, which is a quite general approach at least once its generalizations to ensemble Filters etc. are taken into account (This is also implementable in MATLAB easily).
https://en.wikipedia.org/wiki/Kalman_filter
However the more important part than the actual inference algorithm is typically the design of the model you fit to your data. For your scenario you could start with a simple prediction from past values and add daily rhythms, influences of outside temperature etc. The more (correct) information you put into your model a priori the better your model should be at prediction.
For the full mathematical analysis of this type of problem I can recommend this book: https://doi.org/10.1017/CBO9781107706804
In order to turn this into a calibration problem, we need:
a model that predicts the heat demand depending on inputs and parameters,
observations of the heat demand.
Calibrating this model means tuning the parameters so that the model best predicts the heat demand.
If you go for Python, I suggest to use OpenTURNS, which provides several data assimilation methods, e.g. Kalman filtering (also called BLUE):
https://openturns.github.io/openturns/latest/user_manual/calibration.html

trouble with recurrent neural network algorithm for structured data classification

TL;DR
I need help understanding some parts of a specific algorithm for structured data classification. I'm also open to suggestions for different algorithms for this purpose.
Hi all!
I'm currently working on a system involving classification of structured data (I'd prefer not to reveal anything more about it) for which I'm using a simple backpropagation through structure (BPTS) algorithm. I'm planning on modifying the code to make use of a GPU for an additional speed boost later, but at the moment I'm looking for better algorithms than BPTS that I could use.
I recently stumbled on this paper -> [1] and I was amazed by the results. I decided to give it a try, but I have some trouble understanding some parts of the algorithm, as its description is not very clear. I've already emailed some of the authors requesting clarification, but haven't heard from them yet, so, I'd really appreciate any insight you guys may have to offer.
The high-level description of the algorithm can be found in page 787. There, in Step 1, the authors randomize the network weights and also "Propagate the input attributes of each node through the data structure from frontier nodes to root forwardly and, hence, obtain the output of root node". My understanding is that Step 1 is never repeated, since it's the initialization step. The part I quote indicates that a one-time activation also takes place here. But, what item in the training dataset is used for this activation of the network? And is this activation really supposed to happen only once? For example, in the BPTS algorithm I'm using, for each item in the training dataset, a new neural network - whose topology depends on the current item (data structure) - is created on the fly and activated. Then, the error backpropagates, the weights are updated and saved, and the temporary neural network is destroyed.
Another thing that troubles me is Step 3b. There, the authors mention that they update the parameters {A, B, C, D} NT times, using equations (17), (30) and (34). My understanding is that NT denotes the number of items in the training dataset. But equations (17), (30) and (34) already involve ALL items in the training dataset, so, what's the point of solving them (specifically) NT times?
Yet another thing I failed to get is how exactly their algorithm takes into account the (possibly) different structure of each item in the training dataset. I know how this works in BPTS (I described it above), but it's very unclear to me how it works with their algorithm.
Okay, that's all for now. If anyone has any idea of what might be going on with this algorithm, I'd be very interested in hearing it (or rather, reading it). Also, if you are aware of other promising algorithms and / or network architectures (could long short term memory (LSTM) be of use here?) for structured data classification, please don't hesitate to post them.
Thanks in advance for any useful input!
[1] http://www.eie.polyu.edu.hk/~wcsiu/paper_store/Journal/2003/2003_J4-IEEETrans-ChoChiSiu&Tsoi.pdf

algorithm to combine data for linear fit?

I'm not sure if this is the best place to ask this, but you guys have been helpful with plenty of my CS homework in the past so I figure I'll give it a shot.
I'm looking for an algorithm to blindly combine several dependent variables into an index that produces the best linear fit with an external variable. Basically, it would combine the dependent variables using different mathematical operators, include or not include each one, etc. until an index is developed that best correlates with my external variable.
Has anyone seen/heard of something like this before? Even if you could point me in the right direction or to the right place to ask, I would appreciate it. Thanks.
Sounds like you're trying to do Multivariate Linear Regression or Multiple Regression. The simplest method (Read: less accurate) to do this is to individually compute the linear regression lines of each of the component variables and then do a weighted average of each of the lines. Beyond that I am afraid I will be of little help.
This appears to be simple linear regression using multiple explanatory variables. As the implication here is that you are using a computational approach you could do something as simple apply a linear model to your data using every possible combination of your explanatory variables that you have (whether you want to include interaction effects is your choice), choose a goodness of fit measure (R^2 being just one example) and use that to rank the fit of each model you fit?? The quality of a model is also somewhat subjective in many fields - you could reject a model containing 15 variables if it only moderately improves the fit over a far simpler model just containing 3 variables. If you have not read it already I don't doubt that you will find many useful suggestions in the following text :
Draper, N.R. and Smith, H. (1998).Applied Regression Analysis Wiley Series in Probability and Statistics
You might also try doing a google for the LASSO method of model selection.
The thing you're asking for is essentially the entirety of regression analysis.
this is what linear regression does, and this is a good portion of what "machine learning" does (machine learning is basically just a name for more complicated regression and classification algorithms). There are hundreds or thousands of different approaches with various tradeoffs, but the basic ones frequently work quite well.
If you want to learn more, the coursera course on machine learning is a great place to get a deeper understanding of this.

Predicting missing data values in a database

I have a database, consisting of a whole bunch of records (around 600,000) where some of the records have certain fields missing. My goal is to find a way to predict what the missing data values should be (so I can fill them in) based on the existing data.
One option I am looking at is clustering - i.e. representing the records that are all complete as points in some space, looking for clusters of points, and then when given a record with missing data values try to find out if there are any clusters that could belong in that are consistent with the existing data values. However this may not be possible because some of the data fields are on a nominal scale (e.g. color) and thus can't be put in order.
Another idea I had is to create some sort of probabilistic model that would predict the data, train it on the existing data, and then use it to extrapolate.
What algorithms are available for doing the above, and is there any freely available software that implements those algorithms (This software is going to be in c# by the way).
This is less of an algorithmic and more of a philosophical and methodological question. There are a few different techniques available to tackle this kind of question. Acock (2005) gives a good introduction to some of the methods. Although it may seem that there is a lot of math/statistics involved (and may seem like a lot of effort), it's worth thinking what would happen if you messed up.
Andrew Gelman's blog is also a good resource, although the search functionality on his blog leaves something to be desired...
Hope this helps.
Acock (2005)
http://oregonstate.edu/~acock/growth-curves/working%20with%20missing%20values.pdf
Andrew Gelman's blog
http://www.stat.columbia.edu/~cook/movabletype/mlm/
Dealing with missing values is a methodical question that has to do with the actual meaning of the data.
Several methods you can use (detailed post on my blog):
Ignore the data row. This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you'll obviously get poor performance if the percentage of such rows is high
Use a global constant to fill in for missing values. Like "unknown", "N/A" or minus infinity. This is used because sometimes is just doesnt make sense to try and predict the missing value. For example if you have a DB if, say, college candidates and state of residence is missing for some, filling it in doesn't make much sense...
Use attribute mean. For example if the average income of a US family is X you can use that value to replace missing income values.
Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to "Luxury" and "Low budget" and you're dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you'd get if you factor in the low budget cars
Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms used to generate input for step method #4 (K-Mean\Median etc.)
I'd suggest looking into regression and decision trees first (ID3 tree generation) as they're relatively easy and there are plenty of examples on the net.
As for packages, if you can afford it and you're in the Microsoft world look at SQL Server Analysis Services (SSAS for short) that implement most of the mentioned above.
Here are some links to free data minning software packages:
WEKA - http://www.cs.waikato.ac.nz/ml/weka/index.html
ORANGE - http://www.ailab.si/orange
TANAGRA - http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html
Although not C# he's a pretty good intro to decision trees and baysian learning (using Ruby):
http://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/
http://www.igvita.com/2007/05/23/bayes-classification-in-ruby/
There's also this Ruby library that I find very useful (also for learning purposes):
http://ai4r.rubyforge.org/machineLearning.html
There should be plenty of samples for these algorithms online in any language so I'm sure you'll easily find C# stuff too...
Edited:
Forgot this in my original post. This is a definately MUST HAVE if you're playing with data mining...
Download Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007 (It requires SQL Server Analysis Services - SSAS - which isn't free but you can download a trial).
This will allow you to easily play and try out the different techniques in Excel before you go and implement this stuff yourself. Then again, since you're in the Microsoft ecosystem, you might even decide to go for an SSAS based solution and count on the SQL Server guys to do it for ya :)
Predicting missing values is generally considered to be part of data cleansing phase which needs to be done before the data is mined or analyzed further. This is quite prominent in real world data.
Please have a look at this algorithm http://arxiv.org/abs/math/0701152
Currently Microsoft SQL Server Analysis Services 2008 also comes with algorithms like these http://technet.microsoft.com/en-us/library/ms175312.aspx which help in predictive modelling of attributes.
cheers

How to get scientific results from non-experimental data (datamining?)

I want to obtain maximum performance out of a process with many variables, many of which cannot be controlled.
I cannot run thousands of experiments, so it'd be nice if I could run hundreds of experiments and
vary many controllable parameters
collect data on many parameters indicating performance
'correct,' as much as possible, for those parameters I couldn't control
Tease out the 'best' values for those things I can control, and start all over again
It feels like this would be called data mining, where you're going through tons of data which doesn't immediately appear to relate, but does show correlation after some effort.
So... Where do I start looking at algorithms, concepts, theory of this sort of thing? Even related terms for purposes of search would be useful.
Background: I like to do ultra-marathon cycling, and keep logs of each ride. I'd like to keep more data, and after hundreds of rides be able to pull out information about how I perform.
However, everything varies - routes, environment (temp, pres., hum., sun load, wind, precip., etc), fuel, attitude, weight, water load, etc, etc, etc. I can control a few things, but running the same route 20 times to test out a new fuel regime would just be depressing, and take years to perform all the experiments that I'd like to do. I can, however, record all these things and more(telemetry on bicycle FTW).
It sounds like you want to do some regression analysis. You certainly have plenty of data!
Regression analysis is an extremely common modeling technique in statistics and science. (It could be argued that statistics is the art and science of regression analysis.) There are many statistics packages out there to do the computation you'll need. (I'd recommend one, but I'm years out of date.)
Data mining has gotten a bad name because far too often people assume correlation equals causation. I found that a good technique is to start with variables you know have an influence and build a statistical model around them first. So you know that wind, weight and climb have an influence on how fast you can travel and statistical software can take your dataset and calculate what the correlation between those factors are. That will give you a statistical model or linear equation:
speed = x*weight + y*wind + z*climb + constant
When you explore new variables, you will be able to see if the model is improved or not by comparing a goodness of fit metric like R-squared. So you might check if temperature or time of day adds anything to the model.
You may want to apply a transformation to you data. For instance, you might find that you perform better on colder days. But really cold days and really hot days might hurt performance. In that case, you could assign temperatures to bins or segments: < 0°C; 0°C to 40°C; > 40°C, or some such. The key is to transform the data in a way that matches a rational model of what is going on in the real world, not just the data itself.
In case someone thinks this is not a programming related topic, notice that you can use these same techniques to analyze system performance.
With that many variables you have too many dimensions and you may want to look at Principal Component Analysis. It takes some of the "art" out of regression analysis and lets the data speak for itself. Some software to do that sort of analysis is shown at the bottom of the link.
I have used the Perl module Statistics::Regression for somewhat similar problems in the past. Be warned, however, that regression analysis is definitely an art. As the warning in the Perl module says, it won't make sense to you if you haven't learned the appropriate math.

Resources