Is there a best practice guide and annotations for a data lineage diagram - business-intelligence

I am looking to create a data lineage diagram showing the source and movement of some of our data across different systems and processes and found that there is not one data lineage diagram that looks the same. I just wanted to know if there is best practice out there? There looks like there is a lack of information out there on it too so maybe there is a more popular name it goes by?
Thanks

Related

Kedro Data Modelling

We are struggling to model our data correctly for use in Kedro - we are using the recommended Raw\Int\Prm\Ft\Mst model but are struggling with some of the concepts....e.g.
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
Is it OK for a primary dataset to consume data from another primary dataset?
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
I appreciate there are no hard & fast rules with data modelling but these are big modelling decisions & any guidance or best practice on Kedro modelling would be really helpful, I can find just one table defining the layers in the Kedro docs
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
Great question. As you say, there are no hard and fast rules here and opinions do vary, but let me share my perspective as a QB data scientist and kedro maintainer who has used the layering convention you referred to several times.
For a start, let me emphasise that there's absolutely no reason to stick to the data engineering convention suggested by kedro if it's not suitable for your needs. 99% of users don't change the folder structure in data. This is not because the kedro default is the right structure for them but because they just don't think of changing it. You should absolutely add/remove/rename layers to suit yourself. The most important thing is to choose a set of layers (or even a non-layered structure) that works for your project rather than trying to shoehorn your datasets to fit the kedro default suggestion.
Now, assuming you are following kedro's suggested structure - onto your questions:
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
In the case of simple features, a feature dataset can be very similar to a primary one. The distinction is maybe clearest if you think about more complex features, e.g. formed by aggregating over time windows. A primary dataset would have a column that gives a cleaned version of the original data, but without doing any complex calculations on it, just simple transformations. Say the raw data is the colour of all cars driving past your house over a week. By the time the data is in primary, it will be clean (e.g. correcting "rde" to "red", maybe mapping "crimson" and "red" to the same colour). Between primary and the feature layer, we will have done some less trivial calculations on it, e.g. to find one-hot encoded most common car colour each day.
Is it OK for a primary dataset to consume data from another primary dataset?
In my opinion, yes. This might be necessary if you want to join multiple primary tables together. In general if you are building complex pipelines it will become very difficult if you don't allow this. e.g. in the feature layer I might want to form a dataset containing composite_feature = feature_1 * feature_2 from the two inputs feature_1 and feature_2. There's no way of doing this without having multiple sub-layers within the feature layer.
However, something that is generally worth avoiding is a node that consumes data from many different layers. e.g. a node that takes in one dataset from the feature layer and one from the intermediate layer. This seems a bit strange (why has the latter dataset not passed through the feature layer?).
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
Building features from the intermediate layer isn't unheard of, but it seems a bit weird. The primary layer is typically an important one which forms the basis for all feature engineering. If your data is in a shape that you can build features then that means it's probably primary layer already. In this case, maybe you don't need an intermediate layer.
The above points might be summarised by the following rules (which should no doubt be broken when required):
The input datasets for a node in layer L should all be in the same layer, which can be either L or L-1
The output datasets for a node in layer L should all be in the same layer L, which can be either L or L+1
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
I'm also interested in seeing what others think here! One possibly useful thing to note is that kedro was inspired by cookiecutter data science, and the kedro layer structure is an extended version of what's suggested there. Maybe other projects have taken this directory structure and adapted it in different ways.
Your question prompted us to write a Medium article better explaining these concepts, it's just been published on Toward Data Science

Identify data warehouse design methodologies in the following diagram

Can someone help me identify the top-down, bottom-up, and hybrid data warehouse design methodologies as mentioned here in Wikipedia in the following diagram? I am interested in understanding how the diagram differs depending on each design methodology.
The diagram is too generic to enable identification of a methodology. Further, the Wikipedia article is surprisingly out of date.
There are four mainstream DW methodologies in common use today - Dimensional (Kimball), 3NF (Inmon), Data Vault (Linstedt) and Anchor Modelling (Ronnback). All could be represented within that diagram.
The issue of top-down or bottom-up in this article is centred around data marts. There is no requirement that marts are stored in a separate database, or even in a DBMS. In the context of your diagram they might exist in either the data warehouse or the analysis tool. In any case, the diagram does not give any indication of what came first, so you can't infer an approach.
In order to identify the methodology (Kimball, etc.) that was used to design the warehouse you'd need to see its data model. It would be immediately apparent from the model.
To identify the order in which components were delivered you'd need to see some sort of timeline, project plan, etc.

cluster analysis Hadoop, Map reduce environment

we are currently trying to create some very basic personas based on our user data base (few million profiles). The goal is to find out at this stage what the characteristics of our users are, for example what they look like and what they are looking for and to create several "typical" user profiles.
I believe the best way to achieve this would be to run a cluster analysis in order to find similarities among users.
The big roadblock however is how to get there. We are tracking our data in a Hadoop environment and I am being told that this could be potentially achieved with our tools.
I have familiarised myself with the theory of the topic and know that it can be done for example in SPSS (quite hard to use and limited to samples of large data sets).
The big question: Is it possible to perform a or different types of cluster analysis in a Hadoop environment and then visualise the results like in SPSS? It is my understanding that we would need to run several types of analysis in order to find the best way to cluster the data, also when it comes to distance measurements of the clusters.
I have not found any information on the internet with regards to this, so I wonder if this is possible at all, without a major programming effort (meaning literally implementing for example all the standard tools available in SPSS: Dendrograms, the different result tables and cluster graphs etc.).
Any input would be much appreaciated. Thanks.

How to create training data

Can anybody tell me how to create training data for categorization. I am using OpenNLP for categorization. Is there any tool to create training data or if i have to create it manually then how it should be done? I am a complete noob in this field. Please help
Well, normally you have some kind of historical data of previous (manual) categorization. Else you would have to create the data that your need somehow. Such data is often created by observation.
Although it heavy depends on the data you are trying to categorize.
If your are able to generate training data you would have a perfect algorithm for the data, and you would not need to train a system, would you?
If it is not possible to have training data, you might have to look at algorithms which don't need to learn upfront, i.e. which learn as data comes in and someone is constantly correcting the system's faults.

Mahout's synthetic control data example

Mahout's wiki includes an example of using clustering on synthetic control data (here).
The example includes a data sample with 100 rows of data for each of 6 patterns in the data. What I expect when I run the example code is that some of the clustering methods would provide better or worse clustering, but that they would more or less provide clusters grouping the 6 patterns.
That's not -- at all -- what I'm seeing when I run the examples. As a beginner, this is very confusing. Furthermore, since the data isn't normalized and the periods of the cyclic data don't match up, it's very hard to see how this raw data could ever cluster properly.
Am I missing something? Can a more experienced Mahout-er provide some orientation to what one should expect in this particular example?
I'm very interested in the scenario in which patterns in time-series data can be clustered. I have tried normalizing the data and using point-to-point deltas as the basis for clustering and gotten slightly better results. Does a more experienced Data analyst have suggestions for a better approach?

Resources