What is the opposite of ETL? - reporting

ETL (extract, transform, load) is the process of getting data into a data warehouse from various sources.
Is there a name for the opposite process? Extracting data from a data warehouse, transforming it and putting it into a table - usually to feed a reporting tool.

Technically speaking, the opposite is of an ETL is an ELT.
Instead of extract, transform, then load, an ELT is an extract, load, then transform. The choice between which of the two pipelines should be used depends on the system and the nature of the data. For example, the process of bringing data into a relational database necessarily requires a transformation before loading, but other frameworks, such as Hadoop, are better able to handle unstructured data and apply structure to it after loading takes place.

Since this question has been asked, 6 years ago (!) a lot has changed in the ETL landscape.
There is a new trend called "Reverse ETL" which is the idea of taking cleaned/ transformed/modeled data from your warehouse back into the SaaS applications (Salesforce, Marketo, Zendesk, HubSpot, etc) that your teams use.
The main tools are
getCensus
Seekwell
Grouparoo
You can read more about this nascent trend here and here too

The ETL abbreviation applies to any extract, transform and load sequence. It can be applied to extracting data from a data warehouse, transforming the data and loading the transformed data into a table.
In your question you have two ETL sequences; one that loads the data into the data warehouse and one that extracts information from the data warehouse and loads this data into the table.

Related

What is the difference between a Big Data Warehouse and a traditional Data Warehouse

Usually, data warehouses in the context of big data are managed and implemented on the basis of Hadoop-based system, like Apache Hive (right?).
On the other hand, my question regards the methodological process.
How do big data affect the design process of a data warehouse?
Is the process similar or new tasks must be considered?
Hadoop is similar in architecture to MPP data warehouses, but with some significant differences. Instead of rigidly defined by a parallel architecture, processors are loosely coupled across a Hadoop cluster and each can work on different data sources.
The data manipulation engine, data catalog, and storage engine can work independently of each other with Hadoop serving as a collection point. Also critical is that Hadoop can easily accommodate both structured and unstructured data. This makes it an ideal environment for iterative inquiry. Instead of having to define analytics outputs according to narrow constructs defined by the schema, business users can experiment to find what queries matter to them most. Relevant data can then be extracted and loaded into a data warehouse for fast queries.
The Hadoop ecosystem starts from the same aim of wanting to collect together as much interesting data as possible from different systems, but approaches it in a radically better way. With this approach, you dump all data of interest into a big data store (usually HDFS – Hadoop Distributed File System). This is often in cloud storage – cloud storage is good for the task, because it’s cheap and flexible, and because it puts the data close to cheap cloud computing power. You can still then do ETL and create a data warehouse using tools like Hive if you want, but more importantly you also still have all of the raw data available so you can also define new questions and do complex analyses over all of the raw historical data if you wish. The Hadoop toolset allows great flexibility and power of analysis, since it does big computation by splitting a task over large numbers of cheap commodity machines, letting you perform much more powerful, speculative, and rapid analyses than is possible in a traditional warehouse.

How does GreenPlum handle multiple large joins and simultaneous workloads?

Our product is extracts from our database, they can be as large as 300GB+ in file format. To achieve that we join multiple large tables (tables close to 1TB in size in some cases). We do not aggregate data period, it's pure extracts. How does GreenPlum handle these kind of large data sets (The join keys are 3+ column keys and not every table has the same keys to join with, the only common key is the first key and if data would be distributed by that there will be a lot of skew since the data itself is not balanced).
You should use writable external tables for those types of large data extracts because it can leverage gpfdist and write data in parallel. It will be very fast.
https://gpdb.docs.pivotal.io/510/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
Also, your use case doesn't really indicate skew. Skew would be either storing the data by a poor column choice like gender_code or processing skew where you filter by a column or columns where only a few segments has the data.
In general, Greenplum Database handles this kind of load just fine. The query is executed in parallel on the segments.
Your bottleneck is likely the final export from the database - if you use SQL (or COPY), everything has to go through the master to the client. That takes time, and is slow.
As Jon pointed out, consider using an external table, and write out the data as it comes out of the query. Also avoid any kind of sort operation in your query, if possible. This is unnecessary because the data arrives unsorted in the external table file.

Is it generally better to transform semi-structured into structured data on Hadoop if the possibility exists?

I have large and growing datasets of semi-structured data in JSON files on a Hadoop cluster. The data is fairly benign but one of the keys which holds a list of maps can change heavily in size, it can vary between zero and up to few thousands of those maps each with a few dozen keys themselves.
However the data could be transformed into two separate tables of structured data linked by foreign keys. Both would be narrow tables, one of them would roughly be ten times as long as the other.
I could either keep the data in a semi-structured format and use a wide-column store like HBase to store it or alternatively use a columnar storage like Parquet to store the data in two large relational tables.
It is unlikely the data format will change, but it can't be ruled out.
I'm new to Hadoop and Big Data, so which of the two possibilities is generally preferable? Should semi-structured data be changed into structured data if the possibility exists and the data format is fairly constant?
EDIT: Additional info as requested by Rahul Sharma.
The data consists of shopping carts from a shopping software, the variable length comes from the variable number of items in the carts. Initially the data is in XML format but then is transformed into JSON, but not by me, I have no control over that step.
No realtime analytics planned, only batch analytics.
The relationship in both tables would be that one table is the customer/purchase info while the other would be the purchased items. Both would be linked with a fitting key.
I hope this helps.

Using Hadoop & related projects to analyze usage patterns that constantly change

We're strategizing on how to analyze user "interest" (clicks, likes, etc) on 1M+ items on our site to generate a "similar items" list.
In order to process a large amount of raw data we're learning about Hadoop, Hive, and related projects.
My question is regarding this concern: Hadoop/Hive and the like seem to be geared more towards data dumps, followed by processing cycles. Presumably the end of the processing cycle is something to the extend of an indexed graph of links between related items.
If I'm on track so far, how is data typically processed in these scenarios: I.e.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Do we stream data as it comes in, analyze it and update the data store?
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Is this use case better addressed by Cassandra than Hive/HDFS?
I'm looking to better understand the common approach to this kind of big data processing.
I think this is a good use case for Hadoop family of tools.
It looks to me like HDFS and Flume might be obvious choices, I would look into either HBase or Hive depending on what kinds of analysis you are interested in, how flexible you are in organizing the data
and querying it.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Answer: Hadoop is very good for this. I would use HBase for this, but there are other choices.
Do we stream data as it comes in, analyze it and update the data store?
Answer: Flume is good for this.
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Answer: You have options to do both. Bulk would probably be a MapReduce job on HDFS where piece-by-piece could be managed through HBase column-family values or Hive rows. If you give more details, I could be more precise.
Is this use case better addressed by Cassandra than Hive/HDFS?
Answer: Cassandra and HBase are both implementations of Google's BigTable. I think that choice depends on
how do you need to organize, access, analyze and update data. I can provide more guidance if needed.
HBase is usually better for semi-structured, high R/W processing.
DHFS is generally good choice for flexible, scalable storage of data dumps as you call them.
Flume is applicable for moving streaming data.
I would also consider looking into Titan and HBase if you are thinking graph.
Hive would be applicable if you are interested in tabular-oriented data and using SQL-like queries.

Assessing and comparing Hadoop for Business Intelligence Design considerations

I am considering various technologies for data warehousing and business intelligence, and have come upon this radical tool called Hadoop. Hadoop doesn't seem to be exactly built for BI purposes, but there are references of it having potential in this field. ( http://www.infoworld.com/d/data-explosion/hadoop-pitched-business-intelligence-488).
However little information I have got from the internet, my gut tells me that hadoop can become a disruptive technology in the space of traditional BI solutions. There really is sparse information regarding this topic, and hence I wanted to gather all the Guru's thoughts here on the potential of Hadoop as a BI tool as compared to traditional backend BI infrastructure like Oracle Exadata, vertica etc. For starters, I would like to ask the following question -
Design Considerations - How would designing a BI solution with Hadoop be different from traditional tools? I know it should be different, as I read one cannot create schemas in Hadoop. I also read that a major advantage will be the complete elimination of ETL tools for Hadoop (is this true?) Do we need Hadoop + pig + mahout to get a BI solution??
Thanks & Regards!
Edit - Breaking down into multiple questions. Will start with the one i think most imp.
Hadoop is a great tool to be part of a BI solution. It is not, itself, a BI solution. What Hadoop does is takes in Data_A and outputs Data_B. Whatever is needed for Bi but is not in a useful form can be processed using MapReduce and output a useful form of the data. Be it CSV, HIVE, HBase, MSSQL or anything else used to view data.
I believe Hadoop is supposed to be the ETL tool. That's what we are using it for. We process gigs of log files every hour and store it in Hive and do daily aggregations that are loading into a MSSQL server and viewed through a visualization layer.
The major design considerations I've run against are:
- Data Flexibility: Do you want your users to view pre-aggregated data or have the flexibility to adjust the query and look at the data how they want
- Speed: How long do you want your users to wait for the data? Hive (for example) is slow. It takes minutes to generate results, even on fairly small data sets. The larger the data traversed the longer it will take to generate a result.
- Visualization: What type of visualization do you want to use? Do you want to custom build a lot of pieces or be able to use something off the shelf? What restraints and flexibility are needed for your visualization? How flexible and changeable does the visualization need to be?
hth
Update: As a response to #Bhat's comment asking about lack of visualization...
The lack of a visualization tool that would allow us to effectively utilize the data stored in HBase was a major factor in re-evaluating our solution. We stored the raw data in Hive, and pre-aggregated the data and stored it HBase. To utilize this we were going to have to write a custom connector (did this part) and visualization layer. We looked at what we would be able to produce and what is commercially available, and went the commercial route.
We still use Hadoop as our ETL tool for processing our weblogs, it's fantastic for that. We just send the ETL'd raw data to a commercial big data database that will take the place of both Hive and HBase in our design.
Hadoop doesn't really compare to MSSQL or other data warehouse storage. Hadoop doesn't do any storage (ignoring the HDFS), it does processing of data. Running MapReduces (which Hive does) is going to be slower than MSSQL (or such).
Hadoop is very well suited for storing colossal files that can represent fact tables. These tables can be partitioned by placing individual files representing the table into separate directories. Hive understands such file structures and allows to query them like partitioned tables. You can phrase your BI questions to the Hadoop data in the form of SQL queries via Hive, but you will still need to write and run an occasional MapReduce job.
From business perspective, you should consider Hadoop if you have a lot of low-value data. There are many cases when RDBMS / MPP solutions are not cost effective.
You also should consider Hadoop as a serious option if your data is not structured (HTMLs for example).
We are creating a comparison matrix for BI tools for Big Data / Hadoop
http://hadoopilluminated.com/hadoop_book/BI_Tools_For_Hadoop.html
It is work in progress and would love any input.
(disclaimer : I am the author of this online book)

Resources