I want to use data prediction algorithms on Network data.so can anyone point me on the right direction please.
which algorithm is most effective and how to apply data on those formula's.
You will have to be more specific if you want a more detailed answer. Until then, here's something to help you get started.
When you say network data, the first thing I think about is streams of data packets, streams of transactions, and such. These usually arrive at great speed (e.g. it could be more than thousands per minute), are potentially infinite (all the elements won't fit into into a working memory; the best you can hope for is to store a very small fraction of elements), and usually you want your prediction model to adapt to recent examples while using "small" amount of memory.
It is tempting to say that we could learn a model on a sample of data. However, the model you learn today could no longer be valid the next week---you want to be able to detect these changes and adapt your model continuously.
There is a whole branch of data mining dedicated to scenarios of this type; it is called data stream mining.
Depending on what you want to predict, I suggest that you consult Data Stream Mining: A Practical Approach by the team that develops MOA (which stands for Massive Online Analysis and is, as far as I know, the leading data stream mining toolbox); I also recommend the chapter on data stream mining from the Mining of Massive Data Sets book by Leskovec et al. The Wiki article on data stream mining is a good palce to look for further references.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
As Wikpedia states
The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for
further use
How is this related with Big Data? Is it correct if I say that Hadoop is doing data mining in a parallel manner?
Big data is everything
Big data is a marketing term, not a technical term. Everything is big data these days. My USB stick is a "personal cloud" now, and my harddrive is big data. Seriously. This is a totally unspecific term that is largely defined by what the marketing departments of various very optimistic companies can sell - and the C*Os of major companies buy, in order to make magic happen. Update: and by now, the same applies to data science. It's just marketing.
Data mining is the old big data
Actually, data mining was just as overused... it could mean anything such as
collecting data (think NSA)
storing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the term data mining was actually coined; but where the focus is on new knowledge, not on learning of existing knowledge)
business rules and analytics
visualization
anything involving data you want to sell for truckloads of money
It's just that marketing needed a new term. "Business intelligence", "business analytics", ... they still keep on selling the same stuff, it's just rebranded as "big data" now.
Most "big" data mining isn't big
Since most methods - at least those that give interesting results - just don't scale, most data "mined" isn't actually big. It's clearly much bigger than 10 years ago, but not big as in Exabytes. A survey by KDnuggets had something like 1-10 GB being the average "largest data set analyzed". That is not big data by any data management means; it's only large by what can be analyzed using complex methods. (I'm not talking about trivial algorithms such a k-means).
Most "big data" isn't data mining
Now "Big data" is real. Google has Big data, and CERN also has big data. Most others probably don't. Data starts being big, when you need 1000 computers just to store it.
Big data technologies such as Hadoop are also real. They aren't always used sensibly (don't bother to run hadoop clusters less than 100 nodes - as this point you probably can get much better performance from well-chosen non-clustered machines), but of course people write such software.
But most of what is being done isn't data mining. It's Extract, Transform, Load (ETL), so it is replacing data warehousing. Instead of using a database with structure, indexes and accelerated queries, the data is just dumped into hadoop, and when you have figured out what to do, you re-read all your data and extract the information you really need, tranform it, and load it into your excel spreadsheet. Because after selection, extraction and transformation, usually it's not "big" anymore.
Data quality suffers with size
Many of the marketing promises of big data will not hold. Twitter produces much less insights for most companies than advertised (unless you are a teenie rockstar, that is); and the Twitter user base is heavily biased. Correcting for such a bias is hard, and needs highly experienced statisticians.
Bias from data is one problem - if you just collect some random data from the internet or an appliction, it will usually be not representative; in particular not of potential users. Instead, you will be overfittig to the existing heavy-users if you don't manage to cancel out these effects.
The other big problem is just noise. You have spam bots, but also other tools (think Twitter "trending topics" that cause reinforcement of "trends") that make the data much noiser than other sources. Cleaning this data is hard, and not a matter of technology but of statistical domain expertise. For example Google Flu Trends was repeatedly found to be rather inaccurate. It worked in some of the earlier years (maybe because of overfitting?) but is not anymore of good quality.
Unfortunately, a lot of big data users pay too little attention to this; which is probably one of the many reasons why most big data projects seem to fail (the others being incompetent management, inflated and unrealistic expectations, and lack of company culture and skilled people).
Hadoop != data mining
Now for the second part of your question. Hadoop doesn't do data mining. Hadoop manages data storage (via HDFS, a very primitive kind of distributed database) and it schedules computation tasks, allowing you to run the computation on the same machines that store the data. It does not do any complex analysis.
There are some tools that try to bring data mining to Hadoop. In particular, Apache Mahout can be called the official Apache attempt to do data mining on Hadoop. Except that it is mostly a machine learning tool (machine learning != data mining; data mining sometimes uses methods from machine learning). Some parts of Mahout (such as clustering) are far from advanced. The problem is that Hadoop is good for linear problems, but most data mining isn't linear. And non-linear algorithms don't just scale up to large data; you need to carefully develop linear-time approximations and live with losses in accuracy - losses that must be smaller than what you would lose by simply working on smaller data.
A good example of this trade-off problem is k-means. K-means actually is a (mostly) linear problem; so it can be somewhat run on Hadoop. A single iteration is linear, and if you had a good implementation, it would scale well to big data. However, the number of iterations until convergence also grows with data set size, and thus it isn't really linear. However, as this is a statistical method to find "means", the results actually do not improve much with data set size. So while you can run k-means on big data, it does not make a whole lot of sense - you could just take a sample of your data, run a highly-efficient single-node version of k-means, and the results will be just as good. Because the extra data just gives you some extra digits of precision of a value that you do not need to be that precise.
Since this applies to quite a lot of problems, actual data mining on Hadoop doesn't seem to kick off. Everybody tries to do it, and a lot of companies sell this stuff. But it doesn't really work much better than the non-big version. But as long as customers want to buy this, companies will sell this functionality. And as long as it gets you a grant, researchers will write papers on this. Whether it works or not. That's life.
There are a few cases where these things work. Google search is an example, and Cern. But also image recognition (but not using Hadoop, clusters of GPUs seem to be the way to go there) has recently benefited from an increase in data size. But in any of these cases, you have rather clean data. Google indexes everything; Cern discards any non-interesting data, and only analyzes interesting measurements - there are no spammers feeding their spam into Cern... and in image analysis, you train on preselected relevant images, not on say webcams or random images from the internet (and if so, you treat them as random images, not as representative data).
What is the difference between big data and Hadoop?
A: The difference between big data and the open source software program Hadoop is a distinct and fundamental one. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.
Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers.
Hadoop is one of the tools designed to handle big data. Hadoop and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a MapReduce set of functions and a Hadoop distributed file system (HDFS).
The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.
Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.
See the article posted at http://www.shareideaonline.com/cs/what-is-the-difference-between-big-data-and-hadoop/
Thanks
Ankush
This answer is really intended to add some specificity to the excellent answer from Anony-Mousse.
There's a lot of debate over exactly what Big Data is. Anony-Mousse called out a lot of the issues here around the overuse of terms like analytics, big data, and data mining, but there are a few things I want to provide more detail on.
Big Data
For practical purposes, the best definition I've heard of big data is data that is inconvenient or does not function in a traditional relational database. This could be data of 1PB that cannot be worked with or even just data that is 1GB but has 5,000 columns.
This is a loose and flexible definition. There are always going to be setups or data management tools which can work around it, but, this is where tools like Hadoop, MongoDB, and others can be used more efficiently that prior technology.
What can we do with data that is this inconvenient/large/difficult to work with? It's difficult to simply look at a spreadsheet and to find meaning here, so we often use data mining and machine learning.
Data Mining
This was called out lightly above - my goal here is to be more specific and hopefully to provide more context. Data mining generally applies to somewhat supervised analytic or statistical methods for analysis of data. These may fit into regression, classification, clustering, or collaborative filtering. There's a lot of overlap with machine learning, however, this is still generally driven by a user rather that unsupervised or automated execution, which defines machine learning fairly well.
Machine Learning
Often, machine learning and data mining are used interchangeably. Machine learning encompasses a lot of the same areas as data mining but also includes AI, computer vision, and other unsupervised tasks. The primary difference, and this is definitely a simplification, is that user input is not only unnecessary but generally unwanted. The goal is for these algorithms or systems to self-optimize and to improve, rather than an iterative cycle of development.
Big Data is a TERM which consists of collection of frameworks and tools which could do miracles with the very large data sets including Data Mining.
Hadoop is a framework which will split the very large data sets into blocks(by default 64 mb) then it will store it in HDFS (Hadoop Distributed File System) and then when its execution logic(MapReduce) comes with any bytecode to process the data stored at HDFS. It will take the split based on block(splits can be configured) and impose the extraction and computation via Mapper and Reducer process. By this way you could do ETL process, Data Mining, Data Computation, etc.,
I would like to conclude that Big Data is a terminology which could play with very large data sets. Hadoop is a framework which can do parallel processing very well with its components and services. By that way you can acquire Data mining too..
Big Data is the term people use to say how storage is cheap and easy these days and how data is available to be analyzed.
Data Mining is the process of trying to extract useful information from data.
Usually, Data Mining is related to Big Data for 2 reasons
when you have lots of data, patterns are not so evident, so someone could not just inspect and say "hah". He/she needs tools for that.
for many times lots of data can improve the statistical meaningful to your analysis because your sample is bigger.
Can we say hadoop is dois data mining in parallel? What is hadoop? Their site says
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
So the "parallel" part of your statement is true. The "data mining" part of it is not necessarily. You can just use hadoop to summarize tons of data and this is not necessarily data mining, for example. But for most cases, you can bet people are trying to extract useful info from big data using hadoop, so this is kind of a yes.
I would say that BigData is a modernized framework for addressing the new business needs.
As many people might know BigData is all about 3 v's Volume,Variety and Velocity. BigData is a need to leverage a variety of data (structured and un structured data) and using clustering technique to address volume issue and also getting results in less time ie.velocity.
Where as Datamining is on ETL principle .i.e finding useful information from large datasets using modelling techinques. There are many BI tools available in market to achieve this.
I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.
I have a database, consisting of a whole bunch of records (around 600,000) where some of the records have certain fields missing. My goal is to find a way to predict what the missing data values should be (so I can fill them in) based on the existing data.
One option I am looking at is clustering - i.e. representing the records that are all complete as points in some space, looking for clusters of points, and then when given a record with missing data values try to find out if there are any clusters that could belong in that are consistent with the existing data values. However this may not be possible because some of the data fields are on a nominal scale (e.g. color) and thus can't be put in order.
Another idea I had is to create some sort of probabilistic model that would predict the data, train it on the existing data, and then use it to extrapolate.
What algorithms are available for doing the above, and is there any freely available software that implements those algorithms (This software is going to be in c# by the way).
This is less of an algorithmic and more of a philosophical and methodological question. There are a few different techniques available to tackle this kind of question. Acock (2005) gives a good introduction to some of the methods. Although it may seem that there is a lot of math/statistics involved (and may seem like a lot of effort), it's worth thinking what would happen if you messed up.
Andrew Gelman's blog is also a good resource, although the search functionality on his blog leaves something to be desired...
Hope this helps.
Acock (2005)
http://oregonstate.edu/~acock/growth-curves/working%20with%20missing%20values.pdf
Andrew Gelman's blog
http://www.stat.columbia.edu/~cook/movabletype/mlm/
Dealing with missing values is a methodical question that has to do with the actual meaning of the data.
Several methods you can use (detailed post on my blog):
Ignore the data row. This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you'll obviously get poor performance if the percentage of such rows is high
Use a global constant to fill in for missing values. Like "unknown", "N/A" or minus infinity. This is used because sometimes is just doesnt make sense to try and predict the missing value. For example if you have a DB if, say, college candidates and state of residence is missing for some, filling it in doesn't make much sense...
Use attribute mean. For example if the average income of a US family is X you can use that value to replace missing income values.
Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to "Luxury" and "Low budget" and you're dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you'd get if you factor in the low budget cars
Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms used to generate input for step method #4 (K-Mean\Median etc.)
I'd suggest looking into regression and decision trees first (ID3 tree generation) as they're relatively easy and there are plenty of examples on the net.
As for packages, if you can afford it and you're in the Microsoft world look at SQL Server Analysis Services (SSAS for short) that implement most of the mentioned above.
Here are some links to free data minning software packages:
WEKA - http://www.cs.waikato.ac.nz/ml/weka/index.html
ORANGE - http://www.ailab.si/orange
TANAGRA - http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html
Although not C# he's a pretty good intro to decision trees and baysian learning (using Ruby):
http://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/
http://www.igvita.com/2007/05/23/bayes-classification-in-ruby/
There's also this Ruby library that I find very useful (also for learning purposes):
http://ai4r.rubyforge.org/machineLearning.html
There should be plenty of samples for these algorithms online in any language so I'm sure you'll easily find C# stuff too...
Edited:
Forgot this in my original post. This is a definately MUST HAVE if you're playing with data mining...
Download Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007 (It requires SQL Server Analysis Services - SSAS - which isn't free but you can download a trial).
This will allow you to easily play and try out the different techniques in Excel before you go and implement this stuff yourself. Then again, since you're in the Microsoft ecosystem, you might even decide to go for an SSAS based solution and count on the SQL Server guys to do it for ya :)
Predicting missing values is generally considered to be part of data cleansing phase which needs to be done before the data is mined or analyzed further. This is quite prominent in real world data.
Please have a look at this algorithm http://arxiv.org/abs/math/0701152
Currently Microsoft SQL Server Analysis Services 2008 also comes with algorithms like these http://technet.microsoft.com/en-us/library/ms175312.aspx which help in predictive modelling of attributes.
cheers
We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.
The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.
I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.
I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.
As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)
I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.
You are going to have to persist the data in some sort of database, though.
I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.
Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.
Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.
If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.
Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.
Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.
This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.
Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..
Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.
In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.
Now I think I see the underlying requirements. This is an online or phone survey application with custom surveys. The way to deal with this requirement is to fob the analytics off onto the client. No online tool will let you turn around schema changes in 20 minutes.
I've seen this type of requirement before and it boils down to the client wanting to do some stats on a particular survey. If you can give them a CSV based on the fields (i.e. with named header columns) in their particular survey they can import it into excel and pivot it from there.
This should be fairly easy to implement from a configurable online survey system as you should be able to read the survey configuration. The client will be happy that they can play with their numbers in Excel as they don't have to get their head around a third party tool. Any competent salescritter should be able to spin this to the client as a good thing. You can use a spiel along the lines of 'And you can use familiar tools like Excel to analyse your numbers'. (or SAS if they're that way inclined)
Wrap the exporter in a web page so they can download it themselves and get up-to-date data.
Note that the wheels will come off if you have larger data volumes over 65535 respondents per survey as this won't fit onto a spreadsheet tab. Excel 2007 increases this limit to 1048575. However, surveys with this volume of response will probably be in the minority. One possible workaround is to provide a means to get random samples of the data that are small enough to work with in Excel.
Edit: I don't think there are other solutions that are sufficiently flexible for this type of applicaiton. You've described a holy grail of survey statistics.
I still think that the basic strategy is to give them a data dump. You can pre-package it to some extent by using OLE automation to construct a pivot table and deliver something partially digested. The API for pivot tables in Excel is a bit hairy but this is certainly quite feasible. I have written VBA code that programatically creates pivot tables in the past so I can say from personal experience that this is feasible to do.
The problem becomes a bit more complex if you want to compute and report distributions of (say) response times as you have to construct the displays. You can programatically construct pivot charts if necessary but automating report construction through excel in this way will be a fair bit of work.
You might get some mileage from R (www.r-project.org) as you can construct a framework that lets you import data and generate bespoke reports with a bit of R Code. This is not an end-user tool but your client base sounds like they want canned reports anyway.