Where does map-reduce/hadoop come in in machine learning training? - hadoop

Map-reduce/hadoop is perfect in gathering insights from piles of data from various resources, and organize them in a way we want it to be.
But when it comes to training, my impression is that we have to dump all the training data into algorithm (be it SVN, Logistic regression, or random forest) all at once so that the algorithm is able to come up with a model that has it all. Can map-reduce/hadoop help in the training part? If yes, how in general?

Yes. There are many MapReduce implementations such as hadoop streaming and even some easy tools like Pig, which can be used for learning. In addition, there are distributed learning toolset built upon Map/Reduce such as vowpal wabbit (https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial). The big idea of this kind of methods is to do training on small portion of data (split by HDFS) and then averaging the models and commutation with each nodes. So the model get updates directly from submodels built on part of the data.

Related

number of layers in convolution neural network

I am a beginner in convolution networks. I use digits to implement them and facing with few doubts.
While trying out a basic classification problem of images, how do we decide on the number of layers - how many conv layers/ fully connected layer, etc.
In digits we have 3 standard papers implemented, for a particular dataset is there any way to find out which architecture to use – or when should we use our own architecture.
How can the hidden layers be helpful in solving the problems – i.e. what possible decisions can we take by looking at the results in the hidden layer
Deciding on how many layers or neurons is needed or the best architecture for building neural network was never clear or possible. the main procedure was taken before is to try building on some parameters and then measure the performance on training set and testing set not bias or to over fit the data and decide on the best parameters, or try some other algorithm like genetic algorithm.
conclusion either you start from scratch every time to measure the network performance or apply other algorithms which doesn't need to start from scratch and can build incrementally by applying transfer learning and fine tuning on the network architecture.
The core philosophy that makes deep learning so democratic and amazing is simple "Don't be a Hero".
What it means is that in most cases the best deep learning models take millions of data points and weeks to train, something most of us cannot achieve with our low performance PC's (yes a single GPU system is low performance). So why would you want to waste your time in building and training NN architectures. Simple you don't.
Transfer learning is your solution!! try to find models that are trained on data similar to your problem and use their pre-trained weights to fine tune your data set. Doing this not only do you get an already proven NN architecture but also a major head start in training.
The best place to find pre-trained models is the caffe model zoo so go have a look at it.

What is the difference between Big Data and Data Mining? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
As Wikpedia states
The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for
further use
How is this related with Big Data? Is it correct if I say that Hadoop is doing data mining in a parallel manner?
Big data is everything
Big data is a marketing term, not a technical term. Everything is big data these days. My USB stick is a "personal cloud" now, and my harddrive is big data. Seriously. This is a totally unspecific term that is largely defined by what the marketing departments of various very optimistic companies can sell - and the C*Os of major companies buy, in order to make magic happen. Update: and by now, the same applies to data science. It's just marketing.
Data mining is the old big data
Actually, data mining was just as overused... it could mean anything such as
collecting data (think NSA)
storing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the term data mining was actually coined; but where the focus is on new knowledge, not on learning of existing knowledge)
business rules and analytics
visualization
anything involving data you want to sell for truckloads of money
It's just that marketing needed a new term. "Business intelligence", "business analytics", ... they still keep on selling the same stuff, it's just rebranded as "big data" now.
Most "big" data mining isn't big
Since most methods - at least those that give interesting results - just don't scale, most data "mined" isn't actually big. It's clearly much bigger than 10 years ago, but not big as in Exabytes. A survey by KDnuggets had something like 1-10 GB being the average "largest data set analyzed". That is not big data by any data management means; it's only large by what can be analyzed using complex methods. (I'm not talking about trivial algorithms such a k-means).
Most "big data" isn't data mining
Now "Big data" is real. Google has Big data, and CERN also has big data. Most others probably don't. Data starts being big, when you need 1000 computers just to store it.
Big data technologies such as Hadoop are also real. They aren't always used sensibly (don't bother to run hadoop clusters less than 100 nodes - as this point you probably can get much better performance from well-chosen non-clustered machines), but of course people write such software.
But most of what is being done isn't data mining. It's Extract, Transform, Load (ETL), so it is replacing data warehousing. Instead of using a database with structure, indexes and accelerated queries, the data is just dumped into hadoop, and when you have figured out what to do, you re-read all your data and extract the information you really need, tranform it, and load it into your excel spreadsheet. Because after selection, extraction and transformation, usually it's not "big" anymore.
Data quality suffers with size
Many of the marketing promises of big data will not hold. Twitter produces much less insights for most companies than advertised (unless you are a teenie rockstar, that is); and the Twitter user base is heavily biased. Correcting for such a bias is hard, and needs highly experienced statisticians.
Bias from data is one problem - if you just collect some random data from the internet or an appliction, it will usually be not representative; in particular not of potential users. Instead, you will be overfittig to the existing heavy-users if you don't manage to cancel out these effects.
The other big problem is just noise. You have spam bots, but also other tools (think Twitter "trending topics" that cause reinforcement of "trends") that make the data much noiser than other sources. Cleaning this data is hard, and not a matter of technology but of statistical domain expertise. For example Google Flu Trends was repeatedly found to be rather inaccurate. It worked in some of the earlier years (maybe because of overfitting?) but is not anymore of good quality.
Unfortunately, a lot of big data users pay too little attention to this; which is probably one of the many reasons why most big data projects seem to fail (the others being incompetent management, inflated and unrealistic expectations, and lack of company culture and skilled people).
Hadoop != data mining
Now for the second part of your question. Hadoop doesn't do data mining. Hadoop manages data storage (via HDFS, a very primitive kind of distributed database) and it schedules computation tasks, allowing you to run the computation on the same machines that store the data. It does not do any complex analysis.
There are some tools that try to bring data mining to Hadoop. In particular, Apache Mahout can be called the official Apache attempt to do data mining on Hadoop. Except that it is mostly a machine learning tool (machine learning != data mining; data mining sometimes uses methods from machine learning). Some parts of Mahout (such as clustering) are far from advanced. The problem is that Hadoop is good for linear problems, but most data mining isn't linear. And non-linear algorithms don't just scale up to large data; you need to carefully develop linear-time approximations and live with losses in accuracy - losses that must be smaller than what you would lose by simply working on smaller data.
A good example of this trade-off problem is k-means. K-means actually is a (mostly) linear problem; so it can be somewhat run on Hadoop. A single iteration is linear, and if you had a good implementation, it would scale well to big data. However, the number of iterations until convergence also grows with data set size, and thus it isn't really linear. However, as this is a statistical method to find "means", the results actually do not improve much with data set size. So while you can run k-means on big data, it does not make a whole lot of sense - you could just take a sample of your data, run a highly-efficient single-node version of k-means, and the results will be just as good. Because the extra data just gives you some extra digits of precision of a value that you do not need to be that precise.
Since this applies to quite a lot of problems, actual data mining on Hadoop doesn't seem to kick off. Everybody tries to do it, and a lot of companies sell this stuff. But it doesn't really work much better than the non-big version. But as long as customers want to buy this, companies will sell this functionality. And as long as it gets you a grant, researchers will write papers on this. Whether it works or not. That's life.
There are a few cases where these things work. Google search is an example, and Cern. But also image recognition (but not using Hadoop, clusters of GPUs seem to be the way to go there) has recently benefited from an increase in data size. But in any of these cases, you have rather clean data. Google indexes everything; Cern discards any non-interesting data, and only analyzes interesting measurements - there are no spammers feeding their spam into Cern... and in image analysis, you train on preselected relevant images, not on say webcams or random images from the internet (and if so, you treat them as random images, not as representative data).
What is the difference between big data and Hadoop?
A: The difference between big data and the open source software program Hadoop is a distinct and fundamental one. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.
Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers.
Hadoop is one of the tools designed to handle big data. Hadoop and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a MapReduce set of functions and a Hadoop distributed file system (HDFS).
The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.
Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.
See the article posted at http://www.shareideaonline.com/cs/what-is-the-difference-between-big-data-and-hadoop/
Thanks
Ankush
This answer is really intended to add some specificity to the excellent answer from Anony-Mousse.
There's a lot of debate over exactly what Big Data is. Anony-Mousse called out a lot of the issues here around the overuse of terms like analytics, big data, and data mining, but there are a few things I want to provide more detail on.
Big Data
For practical purposes, the best definition I've heard of big data is data that is inconvenient or does not function in a traditional relational database. This could be data of 1PB that cannot be worked with or even just data that is 1GB but has 5,000 columns.
This is a loose and flexible definition. There are always going to be setups or data management tools which can work around it, but, this is where tools like Hadoop, MongoDB, and others can be used more efficiently that prior technology.
What can we do with data that is this inconvenient/large/difficult to work with? It's difficult to simply look at a spreadsheet and to find meaning here, so we often use data mining and machine learning.
Data Mining
This was called out lightly above - my goal here is to be more specific and hopefully to provide more context. Data mining generally applies to somewhat supervised analytic or statistical methods for analysis of data. These may fit into regression, classification, clustering, or collaborative filtering. There's a lot of overlap with machine learning, however, this is still generally driven by a user rather that unsupervised or automated execution, which defines machine learning fairly well.
Machine Learning
Often, machine learning and data mining are used interchangeably. Machine learning encompasses a lot of the same areas as data mining but also includes AI, computer vision, and other unsupervised tasks. The primary difference, and this is definitely a simplification, is that user input is not only unnecessary but generally unwanted. The goal is for these algorithms or systems to self-optimize and to improve, rather than an iterative cycle of development.
Big Data is a TERM which consists of collection of frameworks and tools which could do miracles with the very large data sets including Data Mining.
Hadoop is a framework which will split the very large data sets into blocks(by default 64 mb) then it will store it in HDFS (Hadoop Distributed File System) and then when its execution logic(MapReduce) comes with any bytecode to process the data stored at HDFS. It will take the split based on block(splits can be configured) and impose the extraction and computation via Mapper and Reducer process. By this way you could do ETL process, Data Mining, Data Computation, etc.,
I would like to conclude that Big Data is a terminology which could play with very large data sets. Hadoop is a framework which can do parallel processing very well with its components and services. By that way you can acquire Data mining too..
Big Data is the term people use to say how storage is cheap and easy these days and how data is available to be analyzed.
Data Mining is the process of trying to extract useful information from data.
Usually, Data Mining is related to Big Data for 2 reasons
when you have lots of data, patterns are not so evident, so someone could not just inspect and say "hah". He/she needs tools for that.
for many times lots of data can improve the statistical meaningful to your analysis because your sample is bigger.
Can we say hadoop is dois data mining in parallel? What is hadoop? Their site says
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
So the "parallel" part of your statement is true. The "data mining" part of it is not necessarily. You can just use hadoop to summarize tons of data and this is not necessarily data mining, for example. But for most cases, you can bet people are trying to extract useful info from big data using hadoop, so this is kind of a yes.
I would say that BigData is a modernized framework for addressing the new business needs.
As many people might know BigData is all about 3 v's Volume,Variety and Velocity. BigData is a need to leverage a variety of data (structured and un structured data) and using clustering technique to address volume issue and also getting results in less time ie.velocity.
Where as Datamining is on ETL principle .i.e finding useful information from large datasets using modelling techinques. There are many BI tools available in market to achieve this.

cluster analysis Hadoop, Map reduce environment

we are currently trying to create some very basic personas based on our user data base (few million profiles). The goal is to find out at this stage what the characteristics of our users are, for example what they look like and what they are looking for and to create several "typical" user profiles.
I believe the best way to achieve this would be to run a cluster analysis in order to find similarities among users.
The big roadblock however is how to get there. We are tracking our data in a Hadoop environment and I am being told that this could be potentially achieved with our tools.
I have familiarised myself with the theory of the topic and know that it can be done for example in SPSS (quite hard to use and limited to samples of large data sets).
The big question: Is it possible to perform a or different types of cluster analysis in a Hadoop environment and then visualise the results like in SPSS? It is my understanding that we would need to run several types of analysis in order to find the best way to cluster the data, also when it comes to distance measurements of the clusters.
I have not found any information on the internet with regards to this, so I wonder if this is possible at all, without a major programming effort (meaning literally implementing for example all the standard tools available in SPSS: Dendrograms, the different result tables and cluster graphs etc.).
Any input would be much appreaciated. Thanks.

MapReduce project with data mining

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

Exporting Mahout model output as Weka input

I would like to use the output model of a Mahout decision tree training process as the input model for a Weka based classifier.
As the training of a complex decision tree that is based on millions of training records is almost impractical for a single node Weka classifier, I would like to use Mahout to build the model, using for example Random Forest Partial Implementation.
While the algorithm above can be problematic while training, it is rather simple to use it for prediction with Weka on a single machine.
On Mahout wiki site it is stated that the data formats for import include Weka ARFF format, but not for export.
Is it possible to use some of the existing implementations in Mahout to train models that will be used in production with a simple Weka based system?
I don't think it's possible to do what you're asking: .arff is a data format, as are all of the other options in the import/export menus. The classifiers that Weka can save/load are, in fact, Weka's java Classifier objects written to a file using Java's Serializable interface. They're not so much portable trees as they are Java objects that last longer than the JVMs which create them. Thus, to do what you want, either Mahout or Weka would have to be able to produce/read each other's code, and that's not something I can find any documentation of.
My experience is that with several million training records (consisting of ~45 numeric features/columns each), Weka's Random Forest implementation using the default options is very fast (operating in seconds on a single 2.26GHz core), so it may not be necessary to bother with Mahout. Your data set may well have different results, though.

Resources