We are struggling to model our data correctly for use in Kedro - we are using the recommended Raw\Int\Prm\Ft\Mst model but are struggling with some of the concepts....e.g.
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
Is it OK for a primary dataset to consume data from another primary dataset?
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
I appreciate there are no hard & fast rules with data modelling but these are big modelling decisions & any guidance or best practice on Kedro modelling would be really helpful, I can find just one table defining the layers in the Kedro docs
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
Great question. As you say, there are no hard and fast rules here and opinions do vary, but let me share my perspective as a QB data scientist and kedro maintainer who has used the layering convention you referred to several times.
For a start, let me emphasise that there's absolutely no reason to stick to the data engineering convention suggested by kedro if it's not suitable for your needs. 99% of users don't change the folder structure in data. This is not because the kedro default is the right structure for them but because they just don't think of changing it. You should absolutely add/remove/rename layers to suit yourself. The most important thing is to choose a set of layers (or even a non-layered structure) that works for your project rather than trying to shoehorn your datasets to fit the kedro default suggestion.
Now, assuming you are following kedro's suggested structure - onto your questions:
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
In the case of simple features, a feature dataset can be very similar to a primary one. The distinction is maybe clearest if you think about more complex features, e.g. formed by aggregating over time windows. A primary dataset would have a column that gives a cleaned version of the original data, but without doing any complex calculations on it, just simple transformations. Say the raw data is the colour of all cars driving past your house over a week. By the time the data is in primary, it will be clean (e.g. correcting "rde" to "red", maybe mapping "crimson" and "red" to the same colour). Between primary and the feature layer, we will have done some less trivial calculations on it, e.g. to find one-hot encoded most common car colour each day.
Is it OK for a primary dataset to consume data from another primary dataset?
In my opinion, yes. This might be necessary if you want to join multiple primary tables together. In general if you are building complex pipelines it will become very difficult if you don't allow this. e.g. in the feature layer I might want to form a dataset containing composite_feature = feature_1 * feature_2 from the two inputs feature_1 and feature_2. There's no way of doing this without having multiple sub-layers within the feature layer.
However, something that is generally worth avoiding is a node that consumes data from many different layers. e.g. a node that takes in one dataset from the feature layer and one from the intermediate layer. This seems a bit strange (why has the latter dataset not passed through the feature layer?).
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
Building features from the intermediate layer isn't unheard of, but it seems a bit weird. The primary layer is typically an important one which forms the basis for all feature engineering. If your data is in a shape that you can build features then that means it's probably primary layer already. In this case, maybe you don't need an intermediate layer.
The above points might be summarised by the following rules (which should no doubt be broken when required):
The input datasets for a node in layer L should all be in the same layer, which can be either L or L-1
The output datasets for a node in layer L should all be in the same layer L, which can be either L or L+1
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
I'm also interested in seeing what others think here! One possibly useful thing to note is that kedro was inspired by cookiecutter data science, and the kedro layer structure is an extended version of what's suggested there. Maybe other projects have taken this directory structure and adapted it in different ways.
Your question prompted us to write a Medium article better explaining these concepts, it's just been published on Toward Data Science
First post ever, so here we go! (Thanks for taking the time to read!)
I am currently studying in college and working on a research project on how different hardware (specifically a ram-disk vs hard rive) can affect the speed of big data analysis. I know how to set up the various hardware and all of that jazz, however, I have no previous experience with big data analysis, and after looking for a few days I have found no answers (even here). I need any software to be able to simulate big data analysis - I have read of Hadoop, but have no idea where to begin on that - and it seems that even with it there is no simulation. How would I go about getting software along with data to analyze? Specifically, something I could run as a control group and then again with the data stored on a ram-disk in order to see if there is a performance increase.
I really feel in over my head here and don't know where to start, so any help or tips are welcome. Thank you very much!
To clarify, I am hoping to begin on a very small-scale database, but I also have resources with my school to set up a very large drive to be able to test with.
There are many DB solutions out there in the market.
However, the big data DB must be designed to process this particular data. The characteristics of big data are summarized as 3V which means data volume, velocity, and variety.
Big data is a large amount of data in terabytes(TB) or more. This is the most basic feature of big data, which means that there is a large amount of data that is still being generated through multiple paths.
Also, large amounts of data must be collected and analyzed in real time in accordance with the user’s needs. The diversity of big data has various forms. That is, it includes all types of data such as a regular, semi-regular and irregular data. In addition to traditional instructed data such as books, magazines, medical records, video and audio, it also includes the data which have location information.
Machbase database is one of big data software you can try. This DB website also offers the user manual and the page of getting started, where users can easily follow instructions. Good luck!!
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
As Wikpedia states
The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for
further use
How is this related with Big Data? Is it correct if I say that Hadoop is doing data mining in a parallel manner?
Big data is everything
Big data is a marketing term, not a technical term. Everything is big data these days. My USB stick is a "personal cloud" now, and my harddrive is big data. Seriously. This is a totally unspecific term that is largely defined by what the marketing departments of various very optimistic companies can sell - and the C*Os of major companies buy, in order to make magic happen. Update: and by now, the same applies to data science. It's just marketing.
Data mining is the old big data
Actually, data mining was just as overused... it could mean anything such as
collecting data (think NSA)
storing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the term data mining was actually coined; but where the focus is on new knowledge, not on learning of existing knowledge)
business rules and analytics
visualization
anything involving data you want to sell for truckloads of money
It's just that marketing needed a new term. "Business intelligence", "business analytics", ... they still keep on selling the same stuff, it's just rebranded as "big data" now.
Most "big" data mining isn't big
Since most methods - at least those that give interesting results - just don't scale, most data "mined" isn't actually big. It's clearly much bigger than 10 years ago, but not big as in Exabytes. A survey by KDnuggets had something like 1-10 GB being the average "largest data set analyzed". That is not big data by any data management means; it's only large by what can be analyzed using complex methods. (I'm not talking about trivial algorithms such a k-means).
Most "big data" isn't data mining
Now "Big data" is real. Google has Big data, and CERN also has big data. Most others probably don't. Data starts being big, when you need 1000 computers just to store it.
Big data technologies such as Hadoop are also real. They aren't always used sensibly (don't bother to run hadoop clusters less than 100 nodes - as this point you probably can get much better performance from well-chosen non-clustered machines), but of course people write such software.
But most of what is being done isn't data mining. It's Extract, Transform, Load (ETL), so it is replacing data warehousing. Instead of using a database with structure, indexes and accelerated queries, the data is just dumped into hadoop, and when you have figured out what to do, you re-read all your data and extract the information you really need, tranform it, and load it into your excel spreadsheet. Because after selection, extraction and transformation, usually it's not "big" anymore.
Data quality suffers with size
Many of the marketing promises of big data will not hold. Twitter produces much less insights for most companies than advertised (unless you are a teenie rockstar, that is); and the Twitter user base is heavily biased. Correcting for such a bias is hard, and needs highly experienced statisticians.
Bias from data is one problem - if you just collect some random data from the internet or an appliction, it will usually be not representative; in particular not of potential users. Instead, you will be overfittig to the existing heavy-users if you don't manage to cancel out these effects.
The other big problem is just noise. You have spam bots, but also other tools (think Twitter "trending topics" that cause reinforcement of "trends") that make the data much noiser than other sources. Cleaning this data is hard, and not a matter of technology but of statistical domain expertise. For example Google Flu Trends was repeatedly found to be rather inaccurate. It worked in some of the earlier years (maybe because of overfitting?) but is not anymore of good quality.
Unfortunately, a lot of big data users pay too little attention to this; which is probably one of the many reasons why most big data projects seem to fail (the others being incompetent management, inflated and unrealistic expectations, and lack of company culture and skilled people).
Hadoop != data mining
Now for the second part of your question. Hadoop doesn't do data mining. Hadoop manages data storage (via HDFS, a very primitive kind of distributed database) and it schedules computation tasks, allowing you to run the computation on the same machines that store the data. It does not do any complex analysis.
There are some tools that try to bring data mining to Hadoop. In particular, Apache Mahout can be called the official Apache attempt to do data mining on Hadoop. Except that it is mostly a machine learning tool (machine learning != data mining; data mining sometimes uses methods from machine learning). Some parts of Mahout (such as clustering) are far from advanced. The problem is that Hadoop is good for linear problems, but most data mining isn't linear. And non-linear algorithms don't just scale up to large data; you need to carefully develop linear-time approximations and live with losses in accuracy - losses that must be smaller than what you would lose by simply working on smaller data.
A good example of this trade-off problem is k-means. K-means actually is a (mostly) linear problem; so it can be somewhat run on Hadoop. A single iteration is linear, and if you had a good implementation, it would scale well to big data. However, the number of iterations until convergence also grows with data set size, and thus it isn't really linear. However, as this is a statistical method to find "means", the results actually do not improve much with data set size. So while you can run k-means on big data, it does not make a whole lot of sense - you could just take a sample of your data, run a highly-efficient single-node version of k-means, and the results will be just as good. Because the extra data just gives you some extra digits of precision of a value that you do not need to be that precise.
Since this applies to quite a lot of problems, actual data mining on Hadoop doesn't seem to kick off. Everybody tries to do it, and a lot of companies sell this stuff. But it doesn't really work much better than the non-big version. But as long as customers want to buy this, companies will sell this functionality. And as long as it gets you a grant, researchers will write papers on this. Whether it works or not. That's life.
There are a few cases where these things work. Google search is an example, and Cern. But also image recognition (but not using Hadoop, clusters of GPUs seem to be the way to go there) has recently benefited from an increase in data size. But in any of these cases, you have rather clean data. Google indexes everything; Cern discards any non-interesting data, and only analyzes interesting measurements - there are no spammers feeding their spam into Cern... and in image analysis, you train on preselected relevant images, not on say webcams or random images from the internet (and if so, you treat them as random images, not as representative data).
What is the difference between big data and Hadoop?
A: The difference between big data and the open source software program Hadoop is a distinct and fundamental one. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.
Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers.
Hadoop is one of the tools designed to handle big data. Hadoop and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a MapReduce set of functions and a Hadoop distributed file system (HDFS).
The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.
Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.
See the article posted at http://www.shareideaonline.com/cs/what-is-the-difference-between-big-data-and-hadoop/
Thanks
Ankush
This answer is really intended to add some specificity to the excellent answer from Anony-Mousse.
There's a lot of debate over exactly what Big Data is. Anony-Mousse called out a lot of the issues here around the overuse of terms like analytics, big data, and data mining, but there are a few things I want to provide more detail on.
Big Data
For practical purposes, the best definition I've heard of big data is data that is inconvenient or does not function in a traditional relational database. This could be data of 1PB that cannot be worked with or even just data that is 1GB but has 5,000 columns.
This is a loose and flexible definition. There are always going to be setups or data management tools which can work around it, but, this is where tools like Hadoop, MongoDB, and others can be used more efficiently that prior technology.
What can we do with data that is this inconvenient/large/difficult to work with? It's difficult to simply look at a spreadsheet and to find meaning here, so we often use data mining and machine learning.
Data Mining
This was called out lightly above - my goal here is to be more specific and hopefully to provide more context. Data mining generally applies to somewhat supervised analytic or statistical methods for analysis of data. These may fit into regression, classification, clustering, or collaborative filtering. There's a lot of overlap with machine learning, however, this is still generally driven by a user rather that unsupervised or automated execution, which defines machine learning fairly well.
Machine Learning
Often, machine learning and data mining are used interchangeably. Machine learning encompasses a lot of the same areas as data mining but also includes AI, computer vision, and other unsupervised tasks. The primary difference, and this is definitely a simplification, is that user input is not only unnecessary but generally unwanted. The goal is for these algorithms or systems to self-optimize and to improve, rather than an iterative cycle of development.
Big Data is a TERM which consists of collection of frameworks and tools which could do miracles with the very large data sets including Data Mining.
Hadoop is a framework which will split the very large data sets into blocks(by default 64 mb) then it will store it in HDFS (Hadoop Distributed File System) and then when its execution logic(MapReduce) comes with any bytecode to process the data stored at HDFS. It will take the split based on block(splits can be configured) and impose the extraction and computation via Mapper and Reducer process. By this way you could do ETL process, Data Mining, Data Computation, etc.,
I would like to conclude that Big Data is a terminology which could play with very large data sets. Hadoop is a framework which can do parallel processing very well with its components and services. By that way you can acquire Data mining too..
Big Data is the term people use to say how storage is cheap and easy these days and how data is available to be analyzed.
Data Mining is the process of trying to extract useful information from data.
Usually, Data Mining is related to Big Data for 2 reasons
when you have lots of data, patterns are not so evident, so someone could not just inspect and say "hah". He/she needs tools for that.
for many times lots of data can improve the statistical meaningful to your analysis because your sample is bigger.
Can we say hadoop is dois data mining in parallel? What is hadoop? Their site says
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
So the "parallel" part of your statement is true. The "data mining" part of it is not necessarily. You can just use hadoop to summarize tons of data and this is not necessarily data mining, for example. But for most cases, you can bet people are trying to extract useful info from big data using hadoop, so this is kind of a yes.
I would say that BigData is a modernized framework for addressing the new business needs.
As many people might know BigData is all about 3 v's Volume,Variety and Velocity. BigData is a need to leverage a variety of data (structured and un structured data) and using clustering technique to address volume issue and also getting results in less time ie.velocity.
Where as Datamining is on ETL principle .i.e finding useful information from large datasets using modelling techinques. There are many BI tools available in market to achieve this.
I'm trying to store 46,000 objects in Sterling and it's taking 3 minutes.
Yes I know it's a lot but this is data provided by the customer and could end up being a lot more.
I'm guessing each time I save a new object it is looking up the key to see if the object has already been stored.
Is there any way to bypass this and tell sterling to just insert?
Any other ideas?
Without knowing anything about your data structure it's a bit difficult to recommend ways in which you could improve performance, however:
The fewer indexes you create for your data tables the fewer indexes there are to create when your data is persisted. You should look carefully at which indexes you need for your data read scenarios.
The more data relationships there are, the more metadata there is to create at write time. You may be able to simplify the data structures and combine classes.
Sheer volume of data sounds like your biggest problem. I've experienced similar problems before with trying to persist large volumes of GPS data. The problem there is that I was trying to write a lot of relatively small amounts of data related to a single piece of data in another table. I managed to resolve this by consolidating the GPS data into a single string and persisting it as a field with the main record. This offloaded a lot of the read/write time into a significantly smaller amount of time for rehydrating the data when it was actually needed.
I woudl definitely recommend reachign out to Jeremy and the Sterling team via the CodePlex site if none of the above help.
Have you considered keeping most data server-side, and presenting client with only a window into that data, something like 20 or 50 rows at a time?
EDIT: since the answer in no, I'd turn off the database table index while the operation is going, or use SQL bulk copy.