How to check data lineage on azure databricks and HDinsight? - azure-databricks

I have notebooks that performs transformation in tables stored in dbfs(databricks file system).I want to capture and display the data lineage. Additionally i want to know how to do the same in hdinsight.

Spline is derived from the words Spark and Lineage. It is a tool which is used to visualize and track how the data changes over time. Spline provides a GUI where the user can view and analyze how the data transforms to give rise to the insights.
You may checkout article which explains Spark Data Lineage on Databricks Notebook using Spline and Data Lineage Tracking And Visualization Solution.

Related

Multiple cells in databricks notebook

I am new to databricks. Question is why there are multiple cells in notebook, when we can write whole set of instructions/program in 1 single cell?
Regards,
The advantage of using Multiple cells is you can break your big code in small portions (in each cell) and can execute that cell individually without the need to execute the complete code which may take long time because of Big Analysis, Large Datasets, Exploratory Data Analysis, Transformation, etc.
In other words, we can say that since Databricks is a Big Data Analysis Tool which involves Large Dataset (millions of rows) ingestion, cleaning of dataset, transformation and then implementing data analysis and machine learning algorithms. All these tasks require large compute resources if you run in single cell. Therefore, you can divide the above mentioned tasks in each cell in Databricks Notebook and run them individually.
Eg: If you are ingesting data from Azure Data Lake Storage account (ADLS), you can create a mount point to the required storage resource and path in a cell and run this cell individually. Now your ADLS container is mounted you can use another cell to prepare the data. In this way, you don't need to mount the resource again as it is already done in previous cell.

What is the opposite of ETL?

ETL (extract, transform, load) is the process of getting data into a data warehouse from various sources.
Is there a name for the opposite process? Extracting data from a data warehouse, transforming it and putting it into a table - usually to feed a reporting tool.
Technically speaking, the opposite is of an ETL is an ELT.
Instead of extract, transform, then load, an ELT is an extract, load, then transform. The choice between which of the two pipelines should be used depends on the system and the nature of the data. For example, the process of bringing data into a relational database necessarily requires a transformation before loading, but other frameworks, such as Hadoop, are better able to handle unstructured data and apply structure to it after loading takes place.
Since this question has been asked, 6 years ago (!) a lot has changed in the ETL landscape.
There is a new trend called "Reverse ETL" which is the idea of taking cleaned/ transformed/modeled data from your warehouse back into the SaaS applications (Salesforce, Marketo, Zendesk, HubSpot, etc) that your teams use.
The main tools are
getCensus
Seekwell
Grouparoo
You can read more about this nascent trend here and here too
The ETL abbreviation applies to any extract, transform and load sequence. It can be applied to extracting data from a data warehouse, transforming the data and loading the transformed data into a table.
In your question you have two ETL sequences; one that loads the data into the data warehouse and one that extracts information from the data warehouse and loads this data into the table.

Best technology stack for aggregation across various properties

We are working on developing a platform which models flow of entities across a graph. The system has to answer questions of the kind how many entities having these properties are sitting at a given node on the graph , what is the inflow on a node, outflow on a node etc. Flow data is fed to the system in a stream. We are thinking of breaking the flow data in time buckets(say 5 mins) and pre-compute various aggregates against different properties and storing the aggregates in DynamoDB to serve queries.
With regards to this we are evaluating the following options:
EMR: Put flow data in AWS -S3/DynamoDB run a Map Reduce/hive job
Putting recent data into AWS- RDS, computing the aggregates via sql
Akka: It is a framework to build distributed applications via Actors
and Message passing.
If anyone has worked on similar usecase or has used any of the above technologies, please let me know what approach would be best fit for our use case.
I have used EMR to process data in S3... works pretty well. And the best part is you can spin up hadoop clusters of various sizes that fit the work load.
you may want to look into Storm for stream processing
I am also collecting a list of big-data tools here: http://hadoopilluminated.com/hadoop_book/Bigdata_Ecosystem.html
The final solution employed AWS Redshift, the driving reason was the requirement of high speed data ingestion, which Redshift provides via the COPY command.
Hadoop is built to store the data efficiently, however it does not gurantees a sub-second sla for ingestion, neither does it provide an SLA for when the data will be available for MR jobs, this was the main reason we did not go with EMR or Hadoop in general.

Using Hadoop & related projects to analyze usage patterns that constantly change

We're strategizing on how to analyze user "interest" (clicks, likes, etc) on 1M+ items on our site to generate a "similar items" list.
In order to process a large amount of raw data we're learning about Hadoop, Hive, and related projects.
My question is regarding this concern: Hadoop/Hive and the like seem to be geared more towards data dumps, followed by processing cycles. Presumably the end of the processing cycle is something to the extend of an indexed graph of links between related items.
If I'm on track so far, how is data typically processed in these scenarios: I.e.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Do we stream data as it comes in, analyze it and update the data store?
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Is this use case better addressed by Cassandra than Hive/HDFS?
I'm looking to better understand the common approach to this kind of big data processing.
I think this is a good use case for Hadoop family of tools.
It looks to me like HDFS and Flume might be obvious choices, I would look into either HBase or Hive depending on what kinds of analysis you are interested in, how flexible you are in organizing the data
and querying it.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Answer: Hadoop is very good for this. I would use HBase for this, but there are other choices.
Do we stream data as it comes in, analyze it and update the data store?
Answer: Flume is good for this.
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Answer: You have options to do both. Bulk would probably be a MapReduce job on HDFS where piece-by-piece could be managed through HBase column-family values or Hive rows. If you give more details, I could be more precise.
Is this use case better addressed by Cassandra than Hive/HDFS?
Answer: Cassandra and HBase are both implementations of Google's BigTable. I think that choice depends on
how do you need to organize, access, analyze and update data. I can provide more guidance if needed.
HBase is usually better for semi-structured, high R/W processing.
DHFS is generally good choice for flexible, scalable storage of data dumps as you call them.
Flume is applicable for moving streaming data.
I would also consider looking into Titan and HBase if you are thinking graph.
Hive would be applicable if you are interested in tabular-oriented data and using SQL-like queries.

Assessing and comparing Hadoop for Business Intelligence Design considerations

I am considering various technologies for data warehousing and business intelligence, and have come upon this radical tool called Hadoop. Hadoop doesn't seem to be exactly built for BI purposes, but there are references of it having potential in this field. ( http://www.infoworld.com/d/data-explosion/hadoop-pitched-business-intelligence-488).
However little information I have got from the internet, my gut tells me that hadoop can become a disruptive technology in the space of traditional BI solutions. There really is sparse information regarding this topic, and hence I wanted to gather all the Guru's thoughts here on the potential of Hadoop as a BI tool as compared to traditional backend BI infrastructure like Oracle Exadata, vertica etc. For starters, I would like to ask the following question -
Design Considerations - How would designing a BI solution with Hadoop be different from traditional tools? I know it should be different, as I read one cannot create schemas in Hadoop. I also read that a major advantage will be the complete elimination of ETL tools for Hadoop (is this true?) Do we need Hadoop + pig + mahout to get a BI solution??
Thanks & Regards!
Edit - Breaking down into multiple questions. Will start with the one i think most imp.
Hadoop is a great tool to be part of a BI solution. It is not, itself, a BI solution. What Hadoop does is takes in Data_A and outputs Data_B. Whatever is needed for Bi but is not in a useful form can be processed using MapReduce and output a useful form of the data. Be it CSV, HIVE, HBase, MSSQL or anything else used to view data.
I believe Hadoop is supposed to be the ETL tool. That's what we are using it for. We process gigs of log files every hour and store it in Hive and do daily aggregations that are loading into a MSSQL server and viewed through a visualization layer.
The major design considerations I've run against are:
- Data Flexibility: Do you want your users to view pre-aggregated data or have the flexibility to adjust the query and look at the data how they want
- Speed: How long do you want your users to wait for the data? Hive (for example) is slow. It takes minutes to generate results, even on fairly small data sets. The larger the data traversed the longer it will take to generate a result.
- Visualization: What type of visualization do you want to use? Do you want to custom build a lot of pieces or be able to use something off the shelf? What restraints and flexibility are needed for your visualization? How flexible and changeable does the visualization need to be?
hth
Update: As a response to #Bhat's comment asking about lack of visualization...
The lack of a visualization tool that would allow us to effectively utilize the data stored in HBase was a major factor in re-evaluating our solution. We stored the raw data in Hive, and pre-aggregated the data and stored it HBase. To utilize this we were going to have to write a custom connector (did this part) and visualization layer. We looked at what we would be able to produce and what is commercially available, and went the commercial route.
We still use Hadoop as our ETL tool for processing our weblogs, it's fantastic for that. We just send the ETL'd raw data to a commercial big data database that will take the place of both Hive and HBase in our design.
Hadoop doesn't really compare to MSSQL or other data warehouse storage. Hadoop doesn't do any storage (ignoring the HDFS), it does processing of data. Running MapReduces (which Hive does) is going to be slower than MSSQL (or such).
Hadoop is very well suited for storing colossal files that can represent fact tables. These tables can be partitioned by placing individual files representing the table into separate directories. Hive understands such file structures and allows to query them like partitioned tables. You can phrase your BI questions to the Hadoop data in the form of SQL queries via Hive, but you will still need to write and run an occasional MapReduce job.
From business perspective, you should consider Hadoop if you have a lot of low-value data. There are many cases when RDBMS / MPP solutions are not cost effective.
You also should consider Hadoop as a serious option if your data is not structured (HTMLs for example).
We are creating a comparison matrix for BI tools for Big Data / Hadoop
http://hadoopilluminated.com/hadoop_book/BI_Tools_For_Hadoop.html
It is work in progress and would love any input.
(disclaimer : I am the author of this online book)

Resources