What is the difference between Big Data and Data Mining? [closed] - hadoop

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
As Wikpedia states
The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for
further use
How is this related with Big Data? Is it correct if I say that Hadoop is doing data mining in a parallel manner?

Big data is everything
Big data is a marketing term, not a technical term. Everything is big data these days. My USB stick is a "personal cloud" now, and my harddrive is big data. Seriously. This is a totally unspecific term that is largely defined by what the marketing departments of various very optimistic companies can sell - and the C*Os of major companies buy, in order to make magic happen. Update: and by now, the same applies to data science. It's just marketing.
Data mining is the old big data
Actually, data mining was just as overused... it could mean anything such as
collecting data (think NSA)
storing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the term data mining was actually coined; but where the focus is on new knowledge, not on learning of existing knowledge)
business rules and analytics
visualization
anything involving data you want to sell for truckloads of money
It's just that marketing needed a new term. "Business intelligence", "business analytics", ... they still keep on selling the same stuff, it's just rebranded as "big data" now.
Most "big" data mining isn't big
Since most methods - at least those that give interesting results - just don't scale, most data "mined" isn't actually big. It's clearly much bigger than 10 years ago, but not big as in Exabytes. A survey by KDnuggets had something like 1-10 GB being the average "largest data set analyzed". That is not big data by any data management means; it's only large by what can be analyzed using complex methods. (I'm not talking about trivial algorithms such a k-means).
Most "big data" isn't data mining
Now "Big data" is real. Google has Big data, and CERN also has big data. Most others probably don't. Data starts being big, when you need 1000 computers just to store it.
Big data technologies such as Hadoop are also real. They aren't always used sensibly (don't bother to run hadoop clusters less than 100 nodes - as this point you probably can get much better performance from well-chosen non-clustered machines), but of course people write such software.
But most of what is being done isn't data mining. It's Extract, Transform, Load (ETL), so it is replacing data warehousing. Instead of using a database with structure, indexes and accelerated queries, the data is just dumped into hadoop, and when you have figured out what to do, you re-read all your data and extract the information you really need, tranform it, and load it into your excel spreadsheet. Because after selection, extraction and transformation, usually it's not "big" anymore.
Data quality suffers with size
Many of the marketing promises of big data will not hold. Twitter produces much less insights for most companies than advertised (unless you are a teenie rockstar, that is); and the Twitter user base is heavily biased. Correcting for such a bias is hard, and needs highly experienced statisticians.
Bias from data is one problem - if you just collect some random data from the internet or an appliction, it will usually be not representative; in particular not of potential users. Instead, you will be overfittig to the existing heavy-users if you don't manage to cancel out these effects.
The other big problem is just noise. You have spam bots, but also other tools (think Twitter "trending topics" that cause reinforcement of "trends") that make the data much noiser than other sources. Cleaning this data is hard, and not a matter of technology but of statistical domain expertise. For example Google Flu Trends was repeatedly found to be rather inaccurate. It worked in some of the earlier years (maybe because of overfitting?) but is not anymore of good quality.
Unfortunately, a lot of big data users pay too little attention to this; which is probably one of the many reasons why most big data projects seem to fail (the others being incompetent management, inflated and unrealistic expectations, and lack of company culture and skilled people).
Hadoop != data mining
Now for the second part of your question. Hadoop doesn't do data mining. Hadoop manages data storage (via HDFS, a very primitive kind of distributed database) and it schedules computation tasks, allowing you to run the computation on the same machines that store the data. It does not do any complex analysis.
There are some tools that try to bring data mining to Hadoop. In particular, Apache Mahout can be called the official Apache attempt to do data mining on Hadoop. Except that it is mostly a machine learning tool (machine learning != data mining; data mining sometimes uses methods from machine learning). Some parts of Mahout (such as clustering) are far from advanced. The problem is that Hadoop is good for linear problems, but most data mining isn't linear. And non-linear algorithms don't just scale up to large data; you need to carefully develop linear-time approximations and live with losses in accuracy - losses that must be smaller than what you would lose by simply working on smaller data.
A good example of this trade-off problem is k-means. K-means actually is a (mostly) linear problem; so it can be somewhat run on Hadoop. A single iteration is linear, and if you had a good implementation, it would scale well to big data. However, the number of iterations until convergence also grows with data set size, and thus it isn't really linear. However, as this is a statistical method to find "means", the results actually do not improve much with data set size. So while you can run k-means on big data, it does not make a whole lot of sense - you could just take a sample of your data, run a highly-efficient single-node version of k-means, and the results will be just as good. Because the extra data just gives you some extra digits of precision of a value that you do not need to be that precise.
Since this applies to quite a lot of problems, actual data mining on Hadoop doesn't seem to kick off. Everybody tries to do it, and a lot of companies sell this stuff. But it doesn't really work much better than the non-big version. But as long as customers want to buy this, companies will sell this functionality. And as long as it gets you a grant, researchers will write papers on this. Whether it works or not. That's life.
There are a few cases where these things work. Google search is an example, and Cern. But also image recognition (but not using Hadoop, clusters of GPUs seem to be the way to go there) has recently benefited from an increase in data size. But in any of these cases, you have rather clean data. Google indexes everything; Cern discards any non-interesting data, and only analyzes interesting measurements - there are no spammers feeding their spam into Cern... and in image analysis, you train on preselected relevant images, not on say webcams or random images from the internet (and if so, you treat them as random images, not as representative data).

What is the difference between big data and Hadoop?
A: The difference between big data and the open source software program Hadoop is a distinct and fundamental one. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.
Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers.
Hadoop is one of the tools designed to handle big data. Hadoop and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a MapReduce set of functions and a Hadoop distributed file system (HDFS).
The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.
Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.
See the article posted at http://www.shareideaonline.com/cs/what-is-the-difference-between-big-data-and-hadoop/
Thanks
Ankush

This answer is really intended to add some specificity to the excellent answer from Anony-Mousse.
There's a lot of debate over exactly what Big Data is. Anony-Mousse called out a lot of the issues here around the overuse of terms like analytics, big data, and data mining, but there are a few things I want to provide more detail on.
Big Data
For practical purposes, the best definition I've heard of big data is data that is inconvenient or does not function in a traditional relational database. This could be data of 1PB that cannot be worked with or even just data that is 1GB but has 5,000 columns.
This is a loose and flexible definition. There are always going to be setups or data management tools which can work around it, but, this is where tools like Hadoop, MongoDB, and others can be used more efficiently that prior technology.
What can we do with data that is this inconvenient/large/difficult to work with? It's difficult to simply look at a spreadsheet and to find meaning here, so we often use data mining and machine learning.
Data Mining
This was called out lightly above - my goal here is to be more specific and hopefully to provide more context. Data mining generally applies to somewhat supervised analytic or statistical methods for analysis of data. These may fit into regression, classification, clustering, or collaborative filtering. There's a lot of overlap with machine learning, however, this is still generally driven by a user rather that unsupervised or automated execution, which defines machine learning fairly well.
Machine Learning
Often, machine learning and data mining are used interchangeably. Machine learning encompasses a lot of the same areas as data mining but also includes AI, computer vision, and other unsupervised tasks. The primary difference, and this is definitely a simplification, is that user input is not only unnecessary but generally unwanted. The goal is for these algorithms or systems to self-optimize and to improve, rather than an iterative cycle of development.

Big Data is a TERM which consists of collection of frameworks and tools which could do miracles with the very large data sets including Data Mining.
Hadoop is a framework which will split the very large data sets into blocks(by default 64 mb) then it will store it in HDFS (Hadoop Distributed File System) and then when its execution logic(MapReduce) comes with any bytecode to process the data stored at HDFS. It will take the split based on block(splits can be configured) and impose the extraction and computation via Mapper and Reducer process. By this way you could do ETL process, Data Mining, Data Computation, etc.,
I would like to conclude that Big Data is a terminology which could play with very large data sets. Hadoop is a framework which can do parallel processing very well with its components and services. By that way you can acquire Data mining too..

Big Data is the term people use to say how storage is cheap and easy these days and how data is available to be analyzed.
Data Mining is the process of trying to extract useful information from data.
Usually, Data Mining is related to Big Data for 2 reasons
when you have lots of data, patterns are not so evident, so someone could not just inspect and say "hah". He/she needs tools for that.
for many times lots of data can improve the statistical meaningful to your analysis because your sample is bigger.
Can we say hadoop is dois data mining in parallel? What is hadoop? Their site says
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
So the "parallel" part of your statement is true. The "data mining" part of it is not necessarily. You can just use hadoop to summarize tons of data and this is not necessarily data mining, for example. But for most cases, you can bet people are trying to extract useful info from big data using hadoop, so this is kind of a yes.

I would say that BigData is a modernized framework for addressing the new business needs.
As many people might know BigData is all about 3 v's Volume,Variety and Velocity. BigData is a need to leverage a variety of data (structured and un structured data) and using clustering technique to address volume issue and also getting results in less time ie.velocity.
Where as Datamining is on ETL principle .i.e finding useful information from large datasets using modelling techinques. There are many BI tools available in market to achieve this.

Related

Big Data Analysis Simulation

First post ever, so here we go! (Thanks for taking the time to read!)
I am currently studying in college and working on a research project on how different hardware (specifically a ram-disk vs hard rive) can affect the speed of big data analysis. I know how to set up the various hardware and all of that jazz, however, I have no previous experience with big data analysis, and after looking for a few days I have found no answers (even here). I need any software to be able to simulate big data analysis - I have read of Hadoop, but have no idea where to begin on that - and it seems that even with it there is no simulation. How would I go about getting software along with data to analyze? Specifically, something I could run as a control group and then again with the data stored on a ram-disk in order to see if there is a performance increase.
I really feel in over my head here and don't know where to start, so any help or tips are welcome. Thank you very much!
To clarify, I am hoping to begin on a very small-scale database, but I also have resources with my school to set up a very large drive to be able to test with.
There are many DB solutions out there in the market.
However, the big data DB must be designed to process this particular data. The characteristics of big data are summarized as 3V which means data volume, velocity, and variety.
Big data is a large amount of data in terabytes(TB) or more. This is the most basic feature of big data, which means that there is a large amount of data that is still being generated through multiple paths.
Also, large amounts of data must be collected and analyzed in real time in accordance with the user’s needs. The diversity of big data has various forms. That is, it includes all types of data such as a regular, semi-regular and irregular data. In addition to traditional instructed data such as books, magazines, medical records, video and audio, it also includes the data which have location information.
Machbase database is one of big data software you can try. This DB website also offers the user manual and the page of getting started, where users can easily follow instructions. Good luck!!

Does a machine learning algorithm copy the data it learns from?

I am not a programmer, rather a law student, but I am currently researching for a project involving artificial intelligence and copyright law. I am currently looking at whether the learning process of a machine learning algorithm may be copyright infringement if a protected work is used by the algorithm. However, this relies on whether or not the algorithm copies the work or does something else.
Can anyone tell me whether machine learning algorithms typically copy the data (picture/text/video/etc.) they are analysing (even if only briefly) or if they are able to obtain the required information from the data through other methods that do not require copying (akin to a human looking at a stop sign and recognising it as a stop sign without necessarily copying the image).
Apologies for my lack of knowledge and I'm sorry if any of my explanation flies in the face of any established machine learning knowledge. As I said, I am merely a lowly law student.
Thanks in advance!
A few machine learning algorithms actually retain a copy of the training set, for example k-nearest neighbours. See https://en.wikipedia.org/wiki/Instance-based_learning. Not all do this; in fact it is usually regarded as a disadvantage, because the training set can be large.
Also, computers are also built round a number of different stores of data of different sizes and speeds. They usually copy data they are working on to small fast stores while they are working on it, because the larger stores take much longer to read and write. One of many possible examples of this has been the subject of legal wrangling of which I know little - see e.g. https://law.stackexchange.com/questions/2223/why-does-browser-cache-not-count-as-copyright-infringement and others for browser cache copyright. If a computer has added two numbers, it will certainly have stored them in its internal memory. It is very likely that it will have stored at least one of them in what are called internal registers - very small very fast memory intended for storing numbers to be worked on.
If a computer (or any other piece of electronic equipment) has been used to process classified data, it is usual to treat it as classified from then on, making the worst case assumption that it might have retained some copy of any of the data it has been used to process, even if retrieving that data from it would in practice require a great deal of specialised expertise with specialised equipment.
Typically, no. The first thing that typical ML algorithms do with their inputs is not to copy or store it, but to compute something based on it and then forget the original. And this is a fair description of what neural networks, regression algorithms and statistical methods do. There is no 'eidetic memory' in mainstream ML. I imagine anything doing that would be marketed as a database or a full text indexing engine or somesuch.
But how will you present your data to an algorithm running on a machine without first copying the data to that machine?
Does a machine learning algorithm copy the data it learns from?
There are many different machine learning algorithms. If you are talking about k nearest neighbor (k-NN) then the answer is simply yes.
However, k-NN is rarely used. Most (all?) other models are not that simple. Usually, a machine learning developer wants the training data to be compressed (a lot, lossy) by the model for several reasons: (1) The amount of training data is large (many GB), (2) Generalization might be better if the training data is compressed (3) inference of new examples might take really long if the data is not compressed. (By "compress", I mean that the relevant information for the task is extracted and irrelevant data is removed. Not compression in the usual sense.)
For other models than k-NN, the answer is more complicated. It depends on what you consider a "copy". For example, from artificial neural networks (especially the sub-type of convolutional neural networks, short: CNNs) the training data can partially be restored. Those models ware state of the art for many (all?) computer vision tasks.
I could not find papers which show that you can (partially) restore / extract training data from CNNs with the focus on possible privacy / copyright problems, but I'm ~70% certain I have read an abstract about this problem. I think I've also heard a talk where a researcher said this was a problem when building a detector for child pornography. However, I don't think that was recorded or anything published about this.
Here are two papers which indicate that restoring training data from CNNs might be possible:
Understanding deep learning requires rethinking generalization
Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images and the Zeiler & Fergus paper
It depends on what you mean by the word "copy". If you run any program, it will copy the data from the hard disk to RAM for processing. I am assuming this is not what you meant.
So let's say you have the copyrighted data in a particular machine and you run your machine learning algorithms on the data, then there is no reason for the algorithm to copy the data out of the machine.
On the other hand, if you use a cloud ML service(AWS/IBM Bluemix/Azure), then you need to upload the data to the cloud before you can run ML algorithms. This would mean you are copying the data.
Hopefully this sheds more light !
Lowly ML student
Some of the machines do copy the data set such as KNN. Unfortunately, such algorithms are not commonly used in practice because they can't be scaled for large data set.
Most ML algorithms use the data set to identify a pattern, that's why pattern recognition is another name for machine learning. The pattern is almost always much smaller (in terms of memory and variables etc) than the original data set.

How to apply data prediction algorithms on networking data?

I want to use data prediction algorithms on Network data.so can anyone point me on the right direction please.
which algorithm is most effective and how to apply data on those formula's.
You will have to be more specific if you want a more detailed answer. Until then, here's something to help you get started.
When you say network data, the first thing I think about is streams of data packets, streams of transactions, and such. These usually arrive at great speed (e.g. it could be more than thousands per minute), are potentially infinite (all the elements won't fit into into a working memory; the best you can hope for is to store a very small fraction of elements), and usually you want your prediction model to adapt to recent examples while using "small" amount of memory.
It is tempting to say that we could learn a model on a sample of data. However, the model you learn today could no longer be valid the next week---you want to be able to detect these changes and adapt your model continuously.
There is a whole branch of data mining dedicated to scenarios of this type; it is called data stream mining.
Depending on what you want to predict, I suggest that you consult Data Stream Mining: A Practical Approach by the team that develops MOA (which stands for Massive Online Analysis and is, as far as I know, the leading data stream mining toolbox); I also recommend the chapter on data stream mining from the Mining of Massive Data Sets book by Leskovec et al. The Wiki article on data stream mining is a good palce to look for further references.

Exasol vs HBase

I'm quite new to BigData architecture so please don't be to harsh on me.
I am trying to figure out the best alternative to build a BI Architecture able to deal with huge amounts of data. As I see it, the solution has to be clustered/horizontally scalable to cope with system growing. I would like to be able to interact with the system using SQL, so HBase + Hive (or even Pig, not for sql but not to need to manually write MR tasks) could be a solution. What would be the benefits/disadvantages of such an architecture opposed to, for instance, Exasolution and their In-Memory - MPP - Columnar solution.
Are there other alternatives which might have some extra-benefits? What about maintenance and configuration? Any Microsoft solution (I may find customer specific needs regarding this)
Sorry for posting such an open question, but I would like to see some discussion so that I can learn from you as much as possible.
Though being an EXASOL guy, I will not start to try to convince you that EXASOL is the one and only good solution out there. It heavily depends on the use case you are trying to implement, and the requirements you have to fulfill.
Hadoop is a very flexible, scalable system and used very often for storing and processing huge volumes of data.
EXASOL in contrast is a specialized RDBMS for complex analytic query processing.
I think that these two options don't really directly compete but complement each other. In many cases companies need a scalable data lake to store and preprocess there data, or to query it in rather simply ways. Once you want to enter the real-time business with complex analytics, where dozens, hundreds or even thousands of analysts are running lots of queries, then an in-memory RDBMS is a great choice.
King, the producer of Candy Crush, combines these two worlds to a powerful data management eco system. They store petabytes of data within Hadoop and use EXASOL on top as an in-memory layer for hundreds of terabytes of data. You can read more about that exciting use case here: http://bit.ly/1TR8APY
Another important difference of these two worlds is the complexity. While EXASOL is tuning-free because it is a specialized system (similar to an appliance) for a certain use case running SQL queries or R/Python/Java in-database-analytics, the Hadoop stack is much more complex. You'll need a certain level of know how to setup, maintain and tune this system. This doesn't need to be a reason for any of the two option. As mentioned, it heavily depends on what you want.
From a price perspective, Hadoop is free and so it should be much cheaper than an in-memory db such as EXASOL, right? Wait a minute, it's not that easy. Again, you have to consider the whole picture. How much data you really want to store, how much of that needs to be queried for analysis, how much hardware would you need to buy, how many people do you have to be hired and trained for the operation or the analytics deployed on the system.
Summary
To summarize my thoughts, the world is too complicated to directly compare these two technologies. Depending on the use case and your personal requirements, either one or the other could be the better option. And in my opinion, the trend in the market is combining such systems to a data mgmt eco systems where you get the best out of the two worlds... Actually three worlds, because the world of operational data processing of NoSQL solutions should also be mentioned here.
I hope that helped a bit. If you need any further details especially about EXASOL, don't hesitate to contact me or connect with me on LinkedIn: de.linkedin.com/in/exagolo

Mixing column and row oriented databases?

I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.

Resources