Big Data Storage and Queries vs Traditional Relational/Non-Relational DBs - hadoop

I am a rising senior CS major at a large state university and I'm working as an intern at a large publicly traded technology company in their data science department. I've learned in school about data structures and algorithms (Maps, Trees, Graphs, Sorting Algorithms, Seaching Algorithms, MapReduce, etc.) and I have some experience through personal projects with MySQL and SQL queries.
My project for this internship is to create a dashboard for displaying analytics gathered from a Hadoop database. I'm struggling to understand how this data is structured and queried. I'm pretty sure all of the data in Hadoop is coming from the production Oracle Relational DB that runs their platform. I guess my core question is why are Hadoop and distributed processing required to gather analytics from a database that is already in a structured format? What does the data look like stored in Hadoop? Are there tables like MySQL, or JSON docs like MongoDB? I will be querying Hadoop through Druid but I'm not sure what is even in this database.
The engineers I have been working with have been great at explaining things to me, especially questions about their specific implementation, but they only have a certain amount of time to dedicate to helping an intern and I want to take the initiative to learn some of this on my own.
As a side note, it is incredible how different it is working on a school project than a project at a company with millions of active users and petabytes of sensitive information.

Hadoop is not a database, and it therefore has no such thing as tables or any inherit structure of relations or documents.
You can place a schema over stored files of various formats like CSV, JSON, Avro, Parquet, etc using Hive, Presto, SparkSQL, for example, but these are all tools that read from the Hadoop FileSystem, and not part of Hadoop itself. Tables and databases at that level are only metadata, and not entirely representative as what the raw data looks like
Hadoop is simply capable of storing more data than an Oracle database, and is free, however for fast analytics, it's recommended to compute statistics within Hadoop frameworks in distributed fashion, then load back into a indexed system (such as Druid) or just any actual database

I get your question. Basically you are trying to understand what and how are the data present in Hadoop and why not Traditional database but the data from a Traditional Database in Hadoop.
Few main points to understand when it comes to Hadoop,
1. Hadoop is not only for structured data, it can be used for semi structured and unstructured data as well. Mostly for the purpose of Analysis of data.
2. Hadoop is a framework and has different components present in it. The majorly used components to query structured data from HDFS is Hive and Impala.
3. As far as structured data are concerned, Hadoop has HDFS and Hive Metastore for storing the data in a structured way. HDFS stores only data files (such as text, avro, parquet, json, etc.,) not metadata (such as column name, count of rows, etc.,). On the other hand Hive Metastore is basically traditional database such as MySQL, Postgres, etc., and this carries only the metadata. So metastore knows where the data of a table is stored in HDFS i.e., the HDFS file path.
For more on this point - you can read one of my posts HERE
4. Why Hadoop? Hadoop is designed to store big amount of data with high availability due to its distributed nature. Also, Hadoop is meant for WRITE once and READ many times - meaning it is more for analytics and reporting purposes not for transactional purposes like how the traditional databases are being used. More importantly, its open-source!
Hope this helps you in getting a baseline!

Related

Hive - Is it a good fit for building a datawarehouse?

So like most Enterprise companies, we have built a data warehouse in Hadoop, with user queries supported in Hive, and now after a few months and user acceptance testing everyone is a little surprised about how it is not like a standard (Oracle/Netezza) database when used by end-users for ad-hoc data analysis.
While I understand that this is probably a very stupid way of doing projects (we should have researched the use cases and best fit technologies before building the product), and I know the basic technical aspects of how Hadoop differs from single node machines... I would still want to understand if using Hadoop/Hive makes sense for data warehouses in any scenario?
For instance,
Are there always trade-offs in query performance or can they be optimized with configuration changes, horizontal scaling of hardware?
Can it ever be as fast as something like Netezza - which uses non-commodity hardware but functions on a similar architecture?
Where is Hadoop great and absolutely defeats everything else in comparison?
I would argue the Hive MetaStore is useful more than HiveServer2 itself as the query interface.
The MetaStore is what Presto and Spark use to get data much quicker than MapReduce, but maybe not as fast as a well-optimized Tez query, and improvements are being made in Hive v2.x+ with LLAP, for example.
In the end, Hive is really only useful if the ingestion pipelines are actually storing the data in columnar formats of ORC or Parquet to begin with. From there, and reasonable query engine can scan through that data fairly quickly, and Hive just happens to be considered the defacto implementation of that access pattern, whereas Impala or Presto are often more used for adhoc access.
That being said, Hive (and other SQL on Hadoop) is not used for "building", it is used for "analyzing"
And I don't know what you mean by "standard" - Hive supports any ODBC/JDBC Connection, so it's not like you go to the CLI for all access, and HUE or Zeppelin make really nice notebooks for SQL analysis over Hive.
To answer your question,
Are there always trade-offs in query performance or can they be optimized with configuration changes, horizontal scaling of hardware?
If you are using only hive tool from Hadoop for Adhoc querying then that is not right choice for adhoc querying and data analysis. We have explore better option according to you use case and make tech selection from Hive LLAP, HBase, Spark, SparkSQL, Spark Streaming, Apache storm, Imapala, Apache Drill and Prestodb etc.
Can it ever be as fast as something like Netezza - which uses non-commodity hardware but functions on a similar architecture?
It is better tool now days most of organization using but you have to be specific about tech tools selection from Hadoop tech stack according to you use case and after studying it do right selection for technology.
Where is Hadoop great and absolutely defeats everything else in comparison?
Hadoop is best for implementing data lake platform in large organization where data scattered across multiple systems, and using Hadoop data lake you can have data at center place. Which can leveraged as data analytics platform for organization data which accumulated over the time period. Also can be used for data stream data processing to get results in real time.
Hope this will help.
Well, there are many benefits of using storing big data in HDFS or say Hadoop ecosystem. To name the most important ones, someone is there who can store and process huge data and the configuration is pretty straight forward.

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ?
From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.
I would also like to know how Hive compares with Pig.
MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.
Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started.
(Just to add on to my answer. No self promotion intended)
Both Hive and Pig queries get converted into MapReduce jobs under the hood.
HTH
I implemented a Hive Data platform recently in my firm and can speak to it in first person since I was a one man team.
Objective
To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language
To replace daily aggregation data generated thru MySQL with Hive
Build Custom reports thru queries in Hive
Architecture Options
I benchmarked the following options:
Hive+HDFS
Hive+HBase - queries were too slow so I dumped this option
Design
Daily log Files were transported to HDFS
MR jobs parsed these log files and output files in HDFS
Create Hive tables with partitions and locations pointing to HDFS locations
Create Hive query scripts (call it HQL if you like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data
Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator
Summary
HBase is like a Map. If you know the key, you can instantly get the value. But if you want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone.
If you have data that needs to be aggregated, rolled up, analyzed across rows then consider Hive.
Hopefully this helps.
Hive actually rocks ...I know, I have lived it for 12 months now... So does HBase...
Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
There are four main modules in Hadoop.
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Before going further, Let's note that we have three different types of data.
Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.
Depending on type of data to be processed, we have to choose right technology.
Some more projects, which are part of Hadoop:
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Hive Vs PIG comparison can be found at this article and my other post at this SE question.
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Have a look at: Hadoop Use Cases.
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.
Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them.
If you select full table scan - use hive. If index access - HBase.
Understanding in depth
Hadoop
Hadoop is an open source project of the Apache foundation. It is a framework written in Java, originally developed by Doug Cutting in 2005. It was created to support distribution for Nutch, the text search engine. Hadoop uses Google's Map Reduce and Google File System Technologies as its foundation.
Features of Hadoop
It is optimized to handle massive quantities of structured, semi-structured and unstructured data using commodity hardware.
It has shared nothing architecture.
It replicates its data into multiple computers so that if one goes down, the data can still be processed from another machine that stores its replica.
Hadoop is for high throughput rather than low latency. It is a batch operation handling massive quantities of data; therefore the response time is not immediate.
It complements Online Transaction Processing and Online Analytical Processing. However, it is not a replacement for a RDBMS.
It is not good when work cannot be parallelized or when there are dependencies within the data.
It is not good for processing small files. It works best with huge data files and data sets.
Versions of Hadoop
There are two versions of Hadoop available :
Hadoop 1.0
Hadoop 2.0
Hadoop 1.0
It has two main parts :
1. Data Storage Framework
It is a general-purpose file system called Hadoop Distributed File System (HDFS).
HDFS is schema-less
It simply stores data files and these data files can be in just about any format.
The idea is to store files as close to their original form as possible.
This in turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement.
2. Data Processing Framework
This is a simple functional programming model initially popularized by Google as MapReduce.
It essentially uses two functions: MAP and REDUCE to process data.
The "Mappers" take in a set of key-value pairs and generate intermediate data (which is another list of key-value pairs).
The "Reducers" then act on this input to produce the output data.
The two functions seemingly work in isolation with one another, thus enabling the processing to be highly distributed in highly parallel, fault-tolerance and scalable way.
Limitations of Hadoop 1.0
The first limitation was the requirement of MapReduce programming expertise.
It supported only batch processing which although is suitable for tasks such as log analysis, large scale data mining projects but pretty much unsuitable for other kinds of projects.
One major limitation was that Hadoop 1.0 was tightly computationally coupled with MapReduce, which meant that the established data management vendors where left with two opinions:
Either rewrite their functionality in MapReduce so that it could be
executed in Hadoop or
Extract data from HDFS or process it outside of Hadoop.
None of the options were viable as it led to process inefficiencies caused by data being moved in and out of the Hadoop cluster.
Hadoop 2.0
In Hadoop 2.0, HDFS continues to be data storage framework.
However, a new and seperate resource management framework called Yet Another Resource Negotiater (YARN) has been added.
Any application capable of dividing itself into parallel tasks is supported by YARN.
YARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the flexibility, scalability and efficiency of applications.
It works by having an Application Master in place of Job Tracker, running applications on resources governed by new Node Manager.
ApplicationMaster is able to run any application and not just MapReduce.
This means it does not only support batch processing but also real-time processing. MapReduce is no longer the only data processing option.
Advantages of Hadoop
It stores data in its native from. There is no structure imposed while keying in data or storing data. HDFS is schema less. It is only later when the data needs to be processed that the structure is imposed on the raw data.
It is scalable. Hadoop can store and distribute very large datasets across hundreds of inexpensive servers that operate in parallel.
It is resilient to failure. Hadoop is fault tolerance. It practices replication of data diligently which means whenever data is sent to any node, the same data also gets replicated to other nodes in the cluster, thereby ensuring that in event of node failure,there will always be another copy of data available for use.
It is flexible. One of the key advantages of Hadoop is that it can work with any kind of data: structured, unstructured or semi-structured. Also, the processing is extremely fast in Hadoop owing to the "move code to data" paradigm.
Hadoop Ecosystem
Following are the components of Hadoop ecosystem:
HDFS: Hadoop Distributed File System. It simply stores data files as close to the original form as possible.
HBase: It is Hadoop's database and compares well with an RDBMS. It supports structured data storage for large tables.
Hive: It enables analysis of large datasets using a language very similar to standard ANSI SQL, which implies that anyone familier with SQL should be able to access data on a Hadoop cluster.
Pig: It is an easy to understand data flow language. It helps with analysis of large datasets which is quite the order with Hadoop. Pig scripts are automatically converted to MapReduce jobs by the Pig interpreter.
ZooKeeper: It is a coordination service for distributed applications.
Oozie: It is a workflow schedular system to manage Apache Hadoop jobs.
Mahout: It is a scalable machine learning and data mining library.
Chukwa: It is data collection system for managing large distributed system.
Sqoop: It is used to transfer bulk data between Hadoop and structured data stores such as relational databases.
Ambari: It is a web based tool for provisioning, managing and monitoring Hadoop clusters.
Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.
Hive is not
A relational database
A design for Online Transaction Processing (OLTP).
A language for real-time queries and row-level updates.
Features of Hive
It stores schema in database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familier, fast, scalable and extensible.
Hive Architecture
The following components are contained in Hive Architecture:
User Interface: Hive is a data warehouse infrastructure that can create interaction between user and HDFS. The User Interfaces that Hive supports are Hive Web UI, Hive Command line and Hive HD Insight(In Windows Server).
MetaStore: Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types and HDFS mapping.
HiveQL Process Engine: HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce in Java, we can write a query for MapReduce and process it.
Exceution Engine: The conjunction part of HiveQL process engine and MapReduce is the Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBase: Hadoop Distributed File System or HBase are the data storage techniques to store data into file system.
For a Comparison Between Hadoop Vs Cassandra/HBase read this post.
Basically HBase enables really fast read and writes with scalability. How fast and scalable? Facebook uses it to manage its user statuses, photos, chat messages etc. HBase is so fast sometimes stacks have been developed by Facebook to use HBase as the data store for Hive itself.
Where As Hive is more like a Data Warehousing solution. You can use a syntax similar to SQL to query Hive contents which results in a Map Reduce job. Not ideal for fast, transactional systems.
I worked on Lambda architecture processing Real time and Batch loads.
Real time processing is needed where fast decisions need to be taken in case of Fire alarm send by sensor or fraud detection in case of banking transactions.
Batch processing is needed to summarize data which can be feed into BI systems.
we used Hadoop ecosystem technologies for above applications.
Real Time Processing
Apache Storm: Stream Data processing, Rule application
HBase: Datastore for serving Realtime dashboard
Batch Processing
Hadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.
Event Handling layer
Apache Kafka was first layer to consume high velocity events from sensor.
Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors.
First of all we should get clear that Hadoop was created as a faster alternative to RDBMS. To process large amount of data at a very fast rate which earlier took a lot of time in RDBMS.
Now one should know the two terms :
Structured Data : This is the data that we used in traditional RDBMS and is divided into well defined structures.
Unstructured Data : This is important to understand, about 80% of the world data is unstructured or semi structured. These are the data which are on its raw form and cannot be processed using RDMS. Example : facebook, twitter data. (http://www.dummies.com/how-to/content/unstructured-data-in-a-big-data-environment.html).
So, large amount of data was being generated in the last few years and the data was mostly unstructured, that gave birth to HADOOP. It was mainly used for very large amount of data that takes unfeasible amount of time using RDBMS. It had many drawbacks, that it could not be used for comparatively small data in real time but they have managed to remove its drawbacks in the newer version.
Before going further I would like to tell that a new Big Data tool is created when they see a fault on the previous tools. So, whichever tool you will see that is created has been done to overcome the problem of the previous tools.
Hadoop can be simply said as two things : Mapreduce and HDFS. Mapreduce is where the processing takes place and HDFS is the DataBase where data is stored. This structure followed WORM principal i.e. write once read multiple times. So, once we have stored data in HDFS, we cannot make changes. This led to the creation of HBASE, a NOSQL product where we can make changes in the data also after writing it once.
But with time we saw that Hadoop had many faults and for that we created different environment over the Hadoop structure. PIG and HIVE are two popular examples.
HIVE was created for people with SQL background. The queries written is similar to SQL named as HIVEQL. HIVE was developed to process completely structured data. It is not used for ustructured data.
PIG on the other hand has its own query language i.e. PIG LATIN. It can be used for both structured as well as unstructured data.
Moving to the difference as when to use HIVE and when to use PIG, I don't think anyone other than the architect of PIG could say. Follow the link :
https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html
Let me try to answer in few words.
Hadoop is an eco-system which comprises of all other tools. So, you can't compare Hadoop but you can compare MapReduce.
Here are my few cents:
Hive: If your need is very SQLish meaning your problem statement can be catered by SQL, then the easiest thing to do would be to use Hive. The other case, when you would use hive is when you want a server to have certain structure of data.
Pig: If you are comfortable with Pig Latin and you need is more of the data pipelines. Also, your data lacks structure. In those cases, you could use Pig. Honestly there is not much difference between Hive & Pig with respect to the use cases.
MapReduce: If your problem can not be solved by using SQL straight, you first should try to create UDF for Hive & Pig and then if the UDF is not solving the problem then getting it done via MapReduce makes sense.
Pig: it is better to handle files and cleaning data
example: removing null values,string handling,unnecessary values
Hive: for querying on cleaned data
1.We are using Hadoop for storing Large data (i.e.structure,Unstructure and Semistructure data ) in the form file format like txt,csv.
2.If We want columnar Updations in our data then we are using Hbase tool
3.In case of Hive , we are storing Big data which is in structured format
and in addition to that we are providing Analysis on that data.
4.Pig is tool which is using Pig latin language to analyze data which is in any format(structure,semistructure and unstructure).
Cleansing Data in Pig is very easy,a suitable approach would be cleansing data through pig and then processing data through hive and later uploading it to hdfs.
Use of Hive, Hbase and Pig w.r.t. my real time experience in different projects.
Hive is used mostly for:
Analytics purpose where you need to do analysis on history data
Generating business reports based on certain columns
Efficiently managing the data together with metadata information
Joining tables on certain columns which are frequently used by using bucketing concept
Efficient Storing and querying using partitioning concept
Not useful for transaction/row level operations like update, delete, etc.
Pig is mostly used for:
Frequent data analysis on huge data
Generating aggregated values/counts on huge data
Generating enterprise level key performance indicators very frequently
Hbase is mostly used:
For real time processing of data
For efficiently managing Complex and nested schema
For real time querying and faster result
For easy Scalability with columns
Useful for transaction/row level operations like update, delete, etc.
Short answer to this question is -
Hadoop - Is Framework which facilitates distributed file system and programming model which allow us to store humongous sized data and process data in distributed fashion very efficiently and with very less processing time compare to traditional approaches.
(HDFS - Hadoop Distributed File system)
(Map Reduce - Programming Model for distributed processing)
Hive - Is query language which allows to read/write data from Hadoop distributed file system in a very popular SQL like fashion. This made life easier for many non-programming background people as they don't have to write Map-Reduce program anymore except for very complex scenarios where Hive is not supported.
Hbase - Is Columnar NoSQL Database. Underlying storage layer for Hbase is again HDFS. Most important use case for this database is to be able to store billion's of rows with million's of columns. Low latency feature of Hbase helps faster and random access of record over distributed data, is very important feature to make it useful for complex projects like Recommender Engines. Also it's record level versioning capability allow user to store transactional data very efficiently (this solves the problem of updating records we have with HDFS and Hive)
Hope this is helpful to quickly understand the above 3 features.
I believe this thread hasn't done in particular justice to HBase and Pig in particular. While I believe Hadoop is the choice of the distributed, resilient file-system for big-data lake implementations, the choice between HBase and Hive is in particular well-segregated.
As in, a lot of use-cases have a particular requirement of SQL like or No-SQL like interfaces. With Phoenix on top of HBase, though SQL like capabilities is certainly achievable, however, the performance, third-party integrations, dashboard update are a kind of painful experiences. However, it's an excellent choice for databases requiring horizontal scaling.
Pig is in particular excellent for non-recursive batch like computations or ETL pipelining (somewhere, where it outperforms Spark by a comfortable distance). Also, it's high-level dataflow implementations is an excellent choice for batch querying and scripting. The choice between Pig and Hive is also pivoted on the need of the client or server-side scripting, required file formats, etc. Pig supports Avro file format which is not true in the case of Hive. The choice for 'procedural dataflow language' vs 'declarative data flow language' is also a strong argument for the choice between pig and hive.
Hadoop:
HDFS stands for Hadoop Distributed File System which uses Computational processing model Map-Reduce.
HBase:
HBase is Key-Value storage, good for reading and writing in near real time.
Hive:
Hive is used for data extraction from the HDFS using SQL-like syntax. Hive use HQL language.
Pig:
Pig is a data flow language for creating ETL. It's an scripting language.
Pig is mostly dead after Cloudera got rid of it in CDP. Also last release on Apache was 19 June, 2017: release 0.17.0 so basically no committers actively working anymore. Use Spark or Python way more powerful than Pig.

Why do Column oriented databases such as Vertica/InfoBright/GreenPlum make a fuss of Hadoop?

What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?
All thse vendor keep saying "we can connect with Hadoop", but I don't understand what's the point. What is the interest of storing in Hadoop and transfering into InfoBright ? Why not have the applications store directly in the Infobright/Vertica DW ?
Thank you !
Why combine the solutions? Hadoop has some great capabilities (see url below). These capabilities though do not include allowing business users to run quick analytics. Queries that take 30 minutes to hours in Hadoop are being delivered in 10’s of seconds with Infobright.
BTW, your initial question did not presuppose an MPP architecture and for good reason. Infobright customers Liverail, AdSafe Media & InMobi, among others, utilize IEE with Hadoop.
If you register for an Industry White Paper http://support.infobright.com/Support/Resource-Library/Whitepapers/ you will see a view of the current marketplace where four suggested Use Cases for Hadoop are outlined. It was authored by Wayne Eckerson , Director of Research, Business Applications and Architecture Group, TechTarget, in September 2011.
1) Create an online archive.
With Hadoop, organizations don’t have to delete or ship the data to offline storage; they can keep it online indefinitely by adding commodity servers to meet storage and processing requirements. Hadoop becomes a low-cost alternative for meeting online archival requirements.
2) Feed the data warehouse.
Organizations can also use Hadoop to parse, integrate and aggregate large volumes of Web or other types of data and then ship it to the data warehouse, where both casual and power users can query and analyze the data using familiar BI tools. Here, Hadoop becomes an ETL tool for processing large volumes of Web data before it lands in the corporate data warehouse.
3) Support analytics.
The big data crowd (i.e., Internet developers) views Hadoop primarily as an analytical engine for running analytical computations against large volumes of data. To query Hadoop, analysts currently need to write programs in Java or other languages and understand MapReduce, a framework for writing distributed (or parallel) applications. The advantage here is that analysts aren’t restricted by SQL when formulating queries. SQL does not support many types of analytics, especially those that involve inter-row calculations, which are common in Web traffic analysis. The disadvantage is that Hadoop is batch-oriented and not conducive to iterative querying.
4) Run reports.
Hadoop’s batch-orientation, however, makes it suitable for executing regularly scheduled reports. Rather than running reports against summary data, organizations can now run them against raw data, guaranteeing the most accurate results.
There are several reasons you may want to do that
1. Cost per TB. The storage costs in Hadoop are much cheaper than Vertica/Netezza/greenplum and the like). You can get long-term retention in Hadoop and shorter term data in the analytics DB
2. Data ingestion capabilities in hadoop (performing transformations) is better in Hadoop
3. programatic analytics (libraries like Mahout ) so you can build advanced text analytics
4. dealing with unstructured data
The MPP dbs provide better performance in ad-hoc queries, better dealing with structured data and connectivity to traditional BI tools (OLAP and reporting) - so basically Hadoop complements the offering of these DBs
Hadoop is more of a platform than a DB.
Think of Hadoop as a neat filesystem that supports lots of queries over different of file types. With this in mind, most people dump raw data onto Hadoop and use it as a staging layer in the data pipeline, where it can chew the data and push it to other systems like vertica or any other. You have several advantages that can be resumed to decoupling.
So Hadoop is turning into the facto storage platform for big data. It is simple, fault-tolerant, scales well, and it is easy to feed and to get data out of it. So most vendors are trying to push a product to companies that probably have a Hadoop installation.
What makes the joint deployment so effective for this software ?
First, both platforms have a lot in common:
Purpose-built from scratch for Big Data transformation and analytics
Leverage MPP architecture to scale out with commodity hardware,
capable of managing TBs through PBs of data
Native HA support with low administration overhead
Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.
By using Vertica’s Hadoop connector, users can easily move data between the two platforms. Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS. A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.
What are the Key differences that make Hadoop and Vertica complement each other when addressing Big Data.
Interface and extensibility
Hadoop
Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community.
Vertica
Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc). This enables both technologists and business analysts to leverage Vertica in their analytic use cases. The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
Tool chain/Eco system
Hadoop
Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging.
Vertica
Vertica integrates with the BI tools because of its standards compliant interface. Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
Storage management
Hadoop
Hadoop replicates data 3 times by default for HA. It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs.
Vertica
Vertica’s columnar compression often achieves 10:1 in its compression ratio. A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries. Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
Runtime optimization
Hadoop
Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O.
Vertica
The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime. As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
Auto tuning
Hadoop
The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs.
Vertica
The Vertica Database Designer provides automatic performance tuning given an input workload. Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.
I'm not a Hadoop user (just a Vertica user/DBA), but I would assume the answer would be something along these lines:
-You already have a setup using Hadoop and you want to add a "Big Data" database for intensive analytical analysis.
-You want to use Hadoop for non-analytical functions and processing and a database for analysis. But it is the same data, so no need for two feeds.
To expand slightly on Arnon's answer, Hadoop has been recognized as a force that is not going away and is gaining increasing traction in organizations, many times via grassroots efforts from developers. MPP databases are good at answering questions that we know about at design time such as "How many transactions do we get per hour by country?".
Hadoop started as a platform for a new type of developer that lives somewhere between analysts and developers, one who can write code but also understands data analysis and machine learning. MPP databases (column or not) are very poor at serving this type of developer who often is analyzing unstructured data, using algorithms that require too much CPU power to run in a database or datasets which are too large. The sheer amount of CPU power required to build some models makes running these algorithms in any sort of traditional sharded DB impossible.
My personal pipeline using hadoop typically looks like:
Run a number of very large global queries in Hadoop to get a basic feel for the data and the distribution of variables.
Use Hadoop to build a smaller dataset with just the data I am interested in.
Export the smaller dataset into a relational DB.
Run lots of small queries on the relational db, build excel sheets, sometimes do a little R.
Bear in mind that this workflow only works for the "analyst developer" or "data scientist". Others mileage will vary.
Coming back to your question due to people like me abandoning their tools these companies are looking for ways to remain relevant in an age where Hadoop is synonymous with big data, the coolest startups and cutting edge technology (whether this is earned or not you may discuss amongst yourselves.) Also many Hadoop installations are an order of magnitude or more larger than an organizations MPP deployments, meaning more data is being retained for longer in Hadoop.
Massive parallel database like Greenplum DB are excellent for handling massive amounts of structured data. Hadoop is excellent at handling even more massive amounts of unstructured data, e.g. websites.
Nowadays, a ton of interesting analytics combines these both types of data to gain insight. Therefore it is important for these database systems to be able to integrate with Hadoop.
For example you could do text processing on the Hadoop Cluster using MapReduce until you have some scoring value per product or something. This scoring value then could be used by the database to combine it with other data that is already stored in the database or data that has been loaded into the database from other sources.
Unstructured data, by their nature, is not suitable for loading into your traditional data warehouse. Hadoop mapreduce jobs can extract structures out of your log files (ex) and then the same can then be ported into your DW for analytics. Hadoop is batch processing, therefore is not suitable for analytic query processing. So you can process your data using hadoop to bring some structure, and then make it query ready via your visualization/sql layer.
What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?
The point is you would not want your users to fire up a query and wait for minutes, sometimes hours before you come back with an answer. Hadoop cannot provide you with a real time query response. Although this is changing with the advent of Cloudera's Impala and Hortonworks's Stinger. These are real-time data processing engines over Hadoop.
Hadoop's underlying data system, HDFS, allows chunking up your data and distributing it over the nodes in your cluster. In fact, HDFS can also be replaced with a 3rd party data storage like S3. Point is: Hadoop provides both -> storage + processing. So you are welcome to use hadoop as storage engine and extract the data into your data warehouse when needed. You can also use Hadoop to create cubes and marts and store these marts in the warehouse.
However, with the advent of Stinger and Impala, the strength of these claims will eventually be erased. So keep an eye out.

Is Apache Hive used more for the programming language or for the data warehouse aspects?

I used to think that Hive was just a SQL-like programming language used to make writing MapReduce-type jobs easier (i.e., a SQL-like version of Pig/Pig Latin). I'm reading more about it now, though, and apparently it's actually a full data warehouse infrastructure.
Is one of these use cases more common? That is, is it primarily used for the data warehouse infrastructure it provides, or more for the SQL-like interface? Or are both aspects of equal utility and importance?
(I'm asking because I'm trying to figure out what parts of Hive I should focus on learning about.)
That's exactly what I used to think too. Now that I've had about a month's experience with Hive, I now find that it's a great ETL tool... for a data warehouse later down the line.
Hive doesn't compare with MDX. Hive is very row-based and doesn't allow a lot of the messier operations that SQL or MDX (Multidimensional Expression Language, common in BI tools) are masters at.
We're using Hive as an ETL tool to integrate our different flat file data sources and reduce the amount of data we have to upload to a SQL-based data warehouse.
If that data only has a half-life spanning a couple of weeks, then we can keep the size of our database relatively manageable, always able to reproduce the reports later on from Hive.
Hive doesn't support updates. In our implementation we used straight MapReduce jobs for populating data warehouse and Hive for making exports for further processing or importing into relational data warehouses. We also used it as an intermediary for a BI reporting tool.

Assessing and comparing Hadoop for Business Intelligence Design considerations

I am considering various technologies for data warehousing and business intelligence, and have come upon this radical tool called Hadoop. Hadoop doesn't seem to be exactly built for BI purposes, but there are references of it having potential in this field. ( http://www.infoworld.com/d/data-explosion/hadoop-pitched-business-intelligence-488).
However little information I have got from the internet, my gut tells me that hadoop can become a disruptive technology in the space of traditional BI solutions. There really is sparse information regarding this topic, and hence I wanted to gather all the Guru's thoughts here on the potential of Hadoop as a BI tool as compared to traditional backend BI infrastructure like Oracle Exadata, vertica etc. For starters, I would like to ask the following question -
Design Considerations - How would designing a BI solution with Hadoop be different from traditional tools? I know it should be different, as I read one cannot create schemas in Hadoop. I also read that a major advantage will be the complete elimination of ETL tools for Hadoop (is this true?) Do we need Hadoop + pig + mahout to get a BI solution??
Thanks & Regards!
Edit - Breaking down into multiple questions. Will start with the one i think most imp.
Hadoop is a great tool to be part of a BI solution. It is not, itself, a BI solution. What Hadoop does is takes in Data_A and outputs Data_B. Whatever is needed for Bi but is not in a useful form can be processed using MapReduce and output a useful form of the data. Be it CSV, HIVE, HBase, MSSQL or anything else used to view data.
I believe Hadoop is supposed to be the ETL tool. That's what we are using it for. We process gigs of log files every hour and store it in Hive and do daily aggregations that are loading into a MSSQL server and viewed through a visualization layer.
The major design considerations I've run against are:
- Data Flexibility: Do you want your users to view pre-aggregated data or have the flexibility to adjust the query and look at the data how they want
- Speed: How long do you want your users to wait for the data? Hive (for example) is slow. It takes minutes to generate results, even on fairly small data sets. The larger the data traversed the longer it will take to generate a result.
- Visualization: What type of visualization do you want to use? Do you want to custom build a lot of pieces or be able to use something off the shelf? What restraints and flexibility are needed for your visualization? How flexible and changeable does the visualization need to be?
hth
Update: As a response to #Bhat's comment asking about lack of visualization...
The lack of a visualization tool that would allow us to effectively utilize the data stored in HBase was a major factor in re-evaluating our solution. We stored the raw data in Hive, and pre-aggregated the data and stored it HBase. To utilize this we were going to have to write a custom connector (did this part) and visualization layer. We looked at what we would be able to produce and what is commercially available, and went the commercial route.
We still use Hadoop as our ETL tool for processing our weblogs, it's fantastic for that. We just send the ETL'd raw data to a commercial big data database that will take the place of both Hive and HBase in our design.
Hadoop doesn't really compare to MSSQL or other data warehouse storage. Hadoop doesn't do any storage (ignoring the HDFS), it does processing of data. Running MapReduces (which Hive does) is going to be slower than MSSQL (or such).
Hadoop is very well suited for storing colossal files that can represent fact tables. These tables can be partitioned by placing individual files representing the table into separate directories. Hive understands such file structures and allows to query them like partitioned tables. You can phrase your BI questions to the Hadoop data in the form of SQL queries via Hive, but you will still need to write and run an occasional MapReduce job.
From business perspective, you should consider Hadoop if you have a lot of low-value data. There are many cases when RDBMS / MPP solutions are not cost effective.
You also should consider Hadoop as a serious option if your data is not structured (HTMLs for example).
We are creating a comparison matrix for BI tools for Big Data / Hadoop
http://hadoopilluminated.com/hadoop_book/BI_Tools_For_Hadoop.html
It is work in progress and would love any input.
(disclaimer : I am the author of this online book)

Resources