Multiple cells in databricks notebook - azure-databricks

I am new to databricks. Question is why there are multiple cells in notebook, when we can write whole set of instructions/program in 1 single cell?
Regards,

The advantage of using Multiple cells is you can break your big code in small portions (in each cell) and can execute that cell individually without the need to execute the complete code which may take long time because of Big Analysis, Large Datasets, Exploratory Data Analysis, Transformation, etc.
In other words, we can say that since Databricks is a Big Data Analysis Tool which involves Large Dataset (millions of rows) ingestion, cleaning of dataset, transformation and then implementing data analysis and machine learning algorithms. All these tasks require large compute resources if you run in single cell. Therefore, you can divide the above mentioned tasks in each cell in Databricks Notebook and run them individually.
Eg: If you are ingesting data from Azure Data Lake Storage account (ADLS), you can create a mount point to the required storage resource and path in a cell and run this cell individually. Now your ADLS container is mounted you can use another cell to prepare the data. In this way, you don't need to mount the resource again as it is already done in previous cell.

Related

How to check data lineage on azure databricks and HDinsight?

I have notebooks that performs transformation in tables stored in dbfs(databricks file system).I want to capture and display the data lineage. Additionally i want to know how to do the same in hdinsight.
Spline is derived from the words Spark and Lineage. It is a tool which is used to visualize and track how the data changes over time. Spline provides a GUI where the user can view and analyze how the data transforms to give rise to the insights.
You may checkout article which explains Spark Data Lineage on Databricks Notebook using Spline and Data Lineage Tracking And Visualization Solution.

Performance optimization of DataFrame based application

I'm writing an application, which produces several files storing them back to S3.
Most of the transformations operate on DataFrames. The current state of the application is already somewhat complex being translated into 60 jobs, some of them mapped to hundreds of stages. Some of the DataFrames are reused along the way and those are cached.
The problem is performance, which is clearly impacted by dependencies.
I have a few questions, any input on any of them will be highly appreciated.
(1) When I split the application into parts, execute them individually reading the inputs from generated files by the previous ones, the total execution time is a fraction (15%) of the execution time of the application run as a whole. This is counterintuitive as the whole application reuses DataFrames already in memory, caching guarantees that no DataFrame is computed more than once and various jobs are executing in parallel wherever possible.
I also noticed that the total number of stages in the latter case is much higher than the first one and I would think they should be comparable. Is there an explanation for this?
(2) If the approach with executing parts of the application individually is the way to go then how to enforce the dependencies between the parts to make sure the necessary inputs are ready.
(3) I read a few books, which devote some chapters to the execution model and performance analysis through the Spark Web UI. All of them use RDDs and I need DataFrames. Obviously even for DataFrame based applications Spark Web UI provides a lot of useful information but the analysis is much harder. Is there a good resource I could use to help me out?
(4) Is there a good example demonstrating how to minimize shuffling by appropriate partitioning of the DataFrame? My attempts so far have been ineffective.
Thanks.
Splitting application is not recommended, if you have more stages and having performance issues then try Checkpointing which saves an RDD to a reliable storage system (e.g. HDFS, S3) while forgetting the RDD’s lineage completely.
//set this property in program
SparkContext.setCheckpointDir(directory: String) method.
//checkpoint RDD
RDD.checkpoint()
If you are using DataFrames then manually checkpoint the data after logical points by introducing Parquet/ORC hops (Writing & Reading data from Parquet/ORC files)
//Write to ORC
dataframe.write.format("orc").save("/tmp/src-dir/tempFile.orc")
//where /tmp/src-dir/ is a HDFS directory
//Read ORC
val orcRead = sqlContext.read.format("orc").load("/tmp/src-dir/tempFile.orc")
Spiliting program is not recommended but still if you want to do it then create separate ".scala" programs and apply dependencies at Oozie level.
3.In Spark web UI refer SQL tab which will give you the execuiton plan. For detailed study on dataframes run
DF.explain() //which will show you the execution plan
DataFrames in Spark have their execution automatically optimized by a query optimizer. Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution. Because the optimizer understands the semantics of operations and structure of the data, it can make intelligent decisions to speed up computation.
Refer Spark guide - http://spark.apache.org/docs/latest/sql-programming-guide.html
4.Sort the data before any operations such as join. To reduce shuffling use repartition function.
DF1 = DF.repartition(10)
Please post your code if you have any other specific doubt.

What is the opposite of ETL?

ETL (extract, transform, load) is the process of getting data into a data warehouse from various sources.
Is there a name for the opposite process? Extracting data from a data warehouse, transforming it and putting it into a table - usually to feed a reporting tool.
Technically speaking, the opposite is of an ETL is an ELT.
Instead of extract, transform, then load, an ELT is an extract, load, then transform. The choice between which of the two pipelines should be used depends on the system and the nature of the data. For example, the process of bringing data into a relational database necessarily requires a transformation before loading, but other frameworks, such as Hadoop, are better able to handle unstructured data and apply structure to it after loading takes place.
Since this question has been asked, 6 years ago (!) a lot has changed in the ETL landscape.
There is a new trend called "Reverse ETL" which is the idea of taking cleaned/ transformed/modeled data from your warehouse back into the SaaS applications (Salesforce, Marketo, Zendesk, HubSpot, etc) that your teams use.
The main tools are
getCensus
Seekwell
Grouparoo
You can read more about this nascent trend here and here too
The ETL abbreviation applies to any extract, transform and load sequence. It can be applied to extracting data from a data warehouse, transforming the data and loading the transformed data into a table.
In your question you have two ETL sequences; one that loads the data into the data warehouse and one that extracts information from the data warehouse and loads this data into the table.

How does Spark read 100K images effeciently?

Currently, I'm programming something on image classification with Spark. I need to read all the images into memory as RDD and my method is as following:
val images = spark.wholeTextFiles("hdfs://imag-dir/")
imag-dir is the target image storing directory on hdfs. With this method, all the images will be loaded into memory and every image will be organized as "image name, image content" pair. However, I find this process is time consuming, is there any better way to load large image data set into spark?
I suspect that may be because you have a lot of small files on HDFS, which is a problem as such (the 'small files problem'). Here you'll find a few suggestions in addressing the issue.
You may also want to set the number of partitions (the minpartitions argument of wholetextFiles) to a reasonable number : at least 2x the number of cores in your cluster (look there for details).
But in sum, apart from the 2 ideas above, the way you're loading those is ok and not where your problem lies (assuming spark is your Spark context).

cassandra and hadoop - realtime vs batch

As per http://www.dbta.com/Articles/Columns/Notes-on-NoSQL/Cassandra-and-Hadoop---Strange-Bedfellows-or-a-Match-Made-in-Heaven-75890.aspx
Cassandra has pursued somewhat different solutions than has Hadoop. Cassandra excels at high-volume real-time transaction processing, while Hadoop excels at more batch-oriented analytical solutions.
What are the differences in the architecture/implementation of Cassandra and Hadoop which account for this sort of difference in usage. (in lay software professional terms)
I wanted to add, because I think there might be a misleading statement here saying Cassandra might perform good for reads.
Cassandra is not very good at random reads either, it's good compared to other solutions out there in how can you read randomly over a huge amount of data, but at some point if the reads are truly random you can't avoid hitting the disk every single time which is expensive, and it may come down to something useless like a few thousand hits/second depending on your cluster, so planning on doing lots of random queries might not be the best, you'll run into a wall if you start thinking like that. I'd say everything in big data works better when you do sequential reads or find a way to sequentially store them. Most cases even when you do real time processing you still want to find a way to batch your queries.
This is why you need to think beforehand what you store under a key and try to get the most information possible out of a read.
It's also kind of funny that statement says transaction and Cassandra in the same sentence, cause that really doesn't happen.
On the other hand hadoop is meant to be batch almost by definition, but hadoop is a distributed map reduce framework, not a db, in fact, I've seen and used lots of hadoop over cassandra, they're not antagonistic technologies.
Handling your big data in real time is doable but requires good thinking and care about when and how you hit the database.
Edit: Removed secondary indices example, as last time I checked that used random reads (though I've been away from Cassandra for more than a year now).
The Vanilla hadoop consists of a Distributed File System (DFS) at the core and libraries to support Map Reduce model to write programs to do analysis. DFS is what enables Hadoop to be scalable. It takes care of chunking data into multiple nodes in a multi node cluster so that Map Reduce can work on individual chunks of data available nodes thus enabling parallelism.
The paper for Google File System which was the basis for Hadoop Distributed File System (HDFS) can be found here
The paper for Map Reduce model can be found here
For a detailed explanation on Map Reduce read this post
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. It is not a conventional database but is more like Hashtable or HashMap which stores a key/value pair. Cassandra works on top of HDFS and makes use of it to scale. Both Cassandra and HBase are implementations of Google's BigTable. Paper for Google BigTable can be found here.
BigTable makes use of a String Sorted Table (SSTable) to store key/value pairs. SSTable is just a File in HDFS which stores key followed by value. Furthermore BigTable maintains a index which has key and offset in the File for that key which enables reading of value for that key using only a seek to the offset location. SSTable is effectively immutable which means after creating the File there is no modifications can be done to existing key/value pairs. New key/value pairs are appended to the file. Update and Delete of records are appended to the file, update with a newer key/value and deletion with a key and tombstone value. Duplicate keys are allowed in this file for SSTable. The index is also modified with whenever update or delete take place so that offset for that key points to the latest value or tombstone value.
Thus you can see Cassandra's internal allow fast read/write which is crucial for real time data handling. Whereas Vanilla Hadoop with Map Reduce can be used to process batch oriented passive data.
Hadoop consists of two fundamental components: distributed datastore (HDFS) and distributed computation framework (MapReduce). It reads a bunch of input data then writes output from/to the datastore. It needs distributed datastore since it performs parallel computing with the local data on cluster of machines to minimize the data loading time.
While Cassandra is the datastore with linear scalability and fault-tolerance ability. It lacks of the parallel computation ability provided by MapReduce in Hadoop.
The default datastore (HDFS) of Hadoop can be replaced with other storage backend, such as Cassandra, Glusterfs, Ceph, Amazon S3, Microsoft Azure's file system, MapR’s FS, and etc. However, each alternatives has its pros and cons, they should be evaluated based on the needs.
There are some resources that help you integrate Hadoop with Cassandra: http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configHadoop.html

Resources