I am considering various technologies for data warehousing and business intelligence, and have come upon this radical tool called Hadoop. Hadoop doesn't seem to be exactly built for BI purposes, but there are references of it having potential in this field. ( http://www.infoworld.com/d/data-explosion/hadoop-pitched-business-intelligence-488).
However little information I have got from the internet, my gut tells me that hadoop can become a disruptive technology in the space of traditional BI solutions. There really is sparse information regarding this topic, and hence I wanted to gather all the Guru's thoughts here on the potential of Hadoop as a BI tool as compared to traditional backend BI infrastructure like Oracle Exadata, vertica etc. For starters, I would like to ask the following question -
Design Considerations - How would designing a BI solution with Hadoop be different from traditional tools? I know it should be different, as I read one cannot create schemas in Hadoop. I also read that a major advantage will be the complete elimination of ETL tools for Hadoop (is this true?) Do we need Hadoop + pig + mahout to get a BI solution??
Thanks & Regards!
Edit - Breaking down into multiple questions. Will start with the one i think most imp.
Hadoop is a great tool to be part of a BI solution. It is not, itself, a BI solution. What Hadoop does is takes in Data_A and outputs Data_B. Whatever is needed for Bi but is not in a useful form can be processed using MapReduce and output a useful form of the data. Be it CSV, HIVE, HBase, MSSQL or anything else used to view data.
I believe Hadoop is supposed to be the ETL tool. That's what we are using it for. We process gigs of log files every hour and store it in Hive and do daily aggregations that are loading into a MSSQL server and viewed through a visualization layer.
The major design considerations I've run against are:
- Data Flexibility: Do you want your users to view pre-aggregated data or have the flexibility to adjust the query and look at the data how they want
- Speed: How long do you want your users to wait for the data? Hive (for example) is slow. It takes minutes to generate results, even on fairly small data sets. The larger the data traversed the longer it will take to generate a result.
- Visualization: What type of visualization do you want to use? Do you want to custom build a lot of pieces or be able to use something off the shelf? What restraints and flexibility are needed for your visualization? How flexible and changeable does the visualization need to be?
hth
Update: As a response to #Bhat's comment asking about lack of visualization...
The lack of a visualization tool that would allow us to effectively utilize the data stored in HBase was a major factor in re-evaluating our solution. We stored the raw data in Hive, and pre-aggregated the data and stored it HBase. To utilize this we were going to have to write a custom connector (did this part) and visualization layer. We looked at what we would be able to produce and what is commercially available, and went the commercial route.
We still use Hadoop as our ETL tool for processing our weblogs, it's fantastic for that. We just send the ETL'd raw data to a commercial big data database that will take the place of both Hive and HBase in our design.
Hadoop doesn't really compare to MSSQL or other data warehouse storage. Hadoop doesn't do any storage (ignoring the HDFS), it does processing of data. Running MapReduces (which Hive does) is going to be slower than MSSQL (or such).
Hadoop is very well suited for storing colossal files that can represent fact tables. These tables can be partitioned by placing individual files representing the table into separate directories. Hive understands such file structures and allows to query them like partitioned tables. You can phrase your BI questions to the Hadoop data in the form of SQL queries via Hive, but you will still need to write and run an occasional MapReduce job.
From business perspective, you should consider Hadoop if you have a lot of low-value data. There are many cases when RDBMS / MPP solutions are not cost effective.
You also should consider Hadoop as a serious option if your data is not structured (HTMLs for example).
We are creating a comparison matrix for BI tools for Big Data / Hadoop
http://hadoopilluminated.com/hadoop_book/BI_Tools_For_Hadoop.html
It is work in progress and would love any input.
(disclaimer : I am the author of this online book)
Related
So like most Enterprise companies, we have built a data warehouse in Hadoop, with user queries supported in Hive, and now after a few months and user acceptance testing everyone is a little surprised about how it is not like a standard (Oracle/Netezza) database when used by end-users for ad-hoc data analysis.
While I understand that this is probably a very stupid way of doing projects (we should have researched the use cases and best fit technologies before building the product), and I know the basic technical aspects of how Hadoop differs from single node machines... I would still want to understand if using Hadoop/Hive makes sense for data warehouses in any scenario?
For instance,
Are there always trade-offs in query performance or can they be optimized with configuration changes, horizontal scaling of hardware?
Can it ever be as fast as something like Netezza - which uses non-commodity hardware but functions on a similar architecture?
Where is Hadoop great and absolutely defeats everything else in comparison?
I would argue the Hive MetaStore is useful more than HiveServer2 itself as the query interface.
The MetaStore is what Presto and Spark use to get data much quicker than MapReduce, but maybe not as fast as a well-optimized Tez query, and improvements are being made in Hive v2.x+ with LLAP, for example.
In the end, Hive is really only useful if the ingestion pipelines are actually storing the data in columnar formats of ORC or Parquet to begin with. From there, and reasonable query engine can scan through that data fairly quickly, and Hive just happens to be considered the defacto implementation of that access pattern, whereas Impala or Presto are often more used for adhoc access.
That being said, Hive (and other SQL on Hadoop) is not used for "building", it is used for "analyzing"
And I don't know what you mean by "standard" - Hive supports any ODBC/JDBC Connection, so it's not like you go to the CLI for all access, and HUE or Zeppelin make really nice notebooks for SQL analysis over Hive.
To answer your question,
Are there always trade-offs in query performance or can they be optimized with configuration changes, horizontal scaling of hardware?
If you are using only hive tool from Hadoop for Adhoc querying then that is not right choice for adhoc querying and data analysis. We have explore better option according to you use case and make tech selection from Hive LLAP, HBase, Spark, SparkSQL, Spark Streaming, Apache storm, Imapala, Apache Drill and Prestodb etc.
Can it ever be as fast as something like Netezza - which uses non-commodity hardware but functions on a similar architecture?
It is better tool now days most of organization using but you have to be specific about tech tools selection from Hadoop tech stack according to you use case and after studying it do right selection for technology.
Where is Hadoop great and absolutely defeats everything else in comparison?
Hadoop is best for implementing data lake platform in large organization where data scattered across multiple systems, and using Hadoop data lake you can have data at center place. Which can leveraged as data analytics platform for organization data which accumulated over the time period. Also can be used for data stream data processing to get results in real time.
Hope this will help.
Well, there are many benefits of using storing big data in HDFS or say Hadoop ecosystem. To name the most important ones, someone is there who can store and process huge data and the configuration is pretty straight forward.
I can't wrap my head around the basic theoretical concept of 'Operational and Analytical Big Data'.
According to me:
Operational Big Data: Branch where we can perform Read/write operations on big data using specially designed Databases (NoSQL). Somewhat similar to ETL in RDMS.
Analytical Big Data: Branch where we analyse data in retrospect and draw predictions using techniques like MPP and MapReduce. Somewhat similar to reporting in RDMS.
(Please feel free to correct wherever I'm wrong, it's just my understanding.)
So according to me, Hadoop is used for Analytical Big Data where we just process data for analysis but don't temper original data and hence is not an idea choice for ETL.
But recently I have come across this article which advocates using Hadoop for ETL: https://www.datanami.com/2014/09/01/five-steps-to-running-etl-on-hadoop-for-web-companies/
Hadoop (MapReduce) is not an efficient processing layer, IMO, without adequate tweaking, so out of the box, the answer is neither. Sure, MapReduce could be used, and under the hood, that API is what most higher level tools depend on, but since those other tools exist, you wouldn't want to go write ETL jobs in plain MapReduce.
You can combine Hadoop with Spark, Presto, HBase, Hive, etc. to unlock these other Operational or Analytical layers, some are useful for reporting use cases, and others are useful for ETL. Again, plenty of knobs to get useful results in a reasonable time compared to an RDBMS (or other NoSQL tools). Plus, it takes several attempts to know how to best store data in Hadoop to begin with (hint: not plaintext, and not lots of small files)
That link is over 5 years old now, and references Flume and Sqoop. Other "web scale" technologies have shown their worth in that time, meanwhile Flume and Sqoop have shown their age can be difficult to configure manage compared to tools like Apache NiFi.
We're strategizing on how to analyze user "interest" (clicks, likes, etc) on 1M+ items on our site to generate a "similar items" list.
In order to process a large amount of raw data we're learning about Hadoop, Hive, and related projects.
My question is regarding this concern: Hadoop/Hive and the like seem to be geared more towards data dumps, followed by processing cycles. Presumably the end of the processing cycle is something to the extend of an indexed graph of links between related items.
If I'm on track so far, how is data typically processed in these scenarios: I.e.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Do we stream data as it comes in, analyze it and update the data store?
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Is this use case better addressed by Cassandra than Hive/HDFS?
I'm looking to better understand the common approach to this kind of big data processing.
I think this is a good use case for Hadoop family of tools.
It looks to me like HDFS and Flume might be obvious choices, I would look into either HBase or Hive depending on what kinds of analysis you are interested in, how flexible you are in organizing the data
and querying it.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Answer: Hadoop is very good for this. I would use HBase for this, but there are other choices.
Do we stream data as it comes in, analyze it and update the data store?
Answer: Flume is good for this.
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Answer: You have options to do both. Bulk would probably be a MapReduce job on HDFS where piece-by-piece could be managed through HBase column-family values or Hive rows. If you give more details, I could be more precise.
Is this use case better addressed by Cassandra than Hive/HDFS?
Answer: Cassandra and HBase are both implementations of Google's BigTable. I think that choice depends on
how do you need to organize, access, analyze and update data. I can provide more guidance if needed.
HBase is usually better for semi-structured, high R/W processing.
DHFS is generally good choice for flexible, scalable storage of data dumps as you call them.
Flume is applicable for moving streaming data.
I would also consider looking into Titan and HBase if you are thinking graph.
Hive would be applicable if you are interested in tabular-oriented data and using SQL-like queries.
What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?
All thse vendor keep saying "we can connect with Hadoop", but I don't understand what's the point. What is the interest of storing in Hadoop and transfering into InfoBright ? Why not have the applications store directly in the Infobright/Vertica DW ?
Thank you !
Why combine the solutions? Hadoop has some great capabilities (see url below). These capabilities though do not include allowing business users to run quick analytics. Queries that take 30 minutes to hours in Hadoop are being delivered in 10’s of seconds with Infobright.
BTW, your initial question did not presuppose an MPP architecture and for good reason. Infobright customers Liverail, AdSafe Media & InMobi, among others, utilize IEE with Hadoop.
If you register for an Industry White Paper http://support.infobright.com/Support/Resource-Library/Whitepapers/ you will see a view of the current marketplace where four suggested Use Cases for Hadoop are outlined. It was authored by Wayne Eckerson , Director of Research, Business Applications and Architecture Group, TechTarget, in September 2011.
1) Create an online archive.
With Hadoop, organizations don’t have to delete or ship the data to offline storage; they can keep it online indefinitely by adding commodity servers to meet storage and processing requirements. Hadoop becomes a low-cost alternative for meeting online archival requirements.
2) Feed the data warehouse.
Organizations can also use Hadoop to parse, integrate and aggregate large volumes of Web or other types of data and then ship it to the data warehouse, where both casual and power users can query and analyze the data using familiar BI tools. Here, Hadoop becomes an ETL tool for processing large volumes of Web data before it lands in the corporate data warehouse.
3) Support analytics.
The big data crowd (i.e., Internet developers) views Hadoop primarily as an analytical engine for running analytical computations against large volumes of data. To query Hadoop, analysts currently need to write programs in Java or other languages and understand MapReduce, a framework for writing distributed (or parallel) applications. The advantage here is that analysts aren’t restricted by SQL when formulating queries. SQL does not support many types of analytics, especially those that involve inter-row calculations, which are common in Web traffic analysis. The disadvantage is that Hadoop is batch-oriented and not conducive to iterative querying.
4) Run reports.
Hadoop’s batch-orientation, however, makes it suitable for executing regularly scheduled reports. Rather than running reports against summary data, organizations can now run them against raw data, guaranteeing the most accurate results.
There are several reasons you may want to do that
1. Cost per TB. The storage costs in Hadoop are much cheaper than Vertica/Netezza/greenplum and the like). You can get long-term retention in Hadoop and shorter term data in the analytics DB
2. Data ingestion capabilities in hadoop (performing transformations) is better in Hadoop
3. programatic analytics (libraries like Mahout ) so you can build advanced text analytics
4. dealing with unstructured data
The MPP dbs provide better performance in ad-hoc queries, better dealing with structured data and connectivity to traditional BI tools (OLAP and reporting) - so basically Hadoop complements the offering of these DBs
Hadoop is more of a platform than a DB.
Think of Hadoop as a neat filesystem that supports lots of queries over different of file types. With this in mind, most people dump raw data onto Hadoop and use it as a staging layer in the data pipeline, where it can chew the data and push it to other systems like vertica or any other. You have several advantages that can be resumed to decoupling.
So Hadoop is turning into the facto storage platform for big data. It is simple, fault-tolerant, scales well, and it is easy to feed and to get data out of it. So most vendors are trying to push a product to companies that probably have a Hadoop installation.
What makes the joint deployment so effective for this software ?
First, both platforms have a lot in common:
Purpose-built from scratch for Big Data transformation and analytics
Leverage MPP architecture to scale out with commodity hardware,
capable of managing TBs through PBs of data
Native HA support with low administration overhead
Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.
By using Vertica’s Hadoop connector, users can easily move data between the two platforms. Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS. A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.
What are the Key differences that make Hadoop and Vertica complement each other when addressing Big Data.
Interface and extensibility
Hadoop
Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community.
Vertica
Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc). This enables both technologists and business analysts to leverage Vertica in their analytic use cases. The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
Tool chain/Eco system
Hadoop
Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging.
Vertica
Vertica integrates with the BI tools because of its standards compliant interface. Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
Storage management
Hadoop
Hadoop replicates data 3 times by default for HA. It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs.
Vertica
Vertica’s columnar compression often achieves 10:1 in its compression ratio. A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries. Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
Runtime optimization
Hadoop
Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O.
Vertica
The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime. As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
Auto tuning
Hadoop
The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs.
Vertica
The Vertica Database Designer provides automatic performance tuning given an input workload. Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.
I'm not a Hadoop user (just a Vertica user/DBA), but I would assume the answer would be something along these lines:
-You already have a setup using Hadoop and you want to add a "Big Data" database for intensive analytical analysis.
-You want to use Hadoop for non-analytical functions and processing and a database for analysis. But it is the same data, so no need for two feeds.
To expand slightly on Arnon's answer, Hadoop has been recognized as a force that is not going away and is gaining increasing traction in organizations, many times via grassroots efforts from developers. MPP databases are good at answering questions that we know about at design time such as "How many transactions do we get per hour by country?".
Hadoop started as a platform for a new type of developer that lives somewhere between analysts and developers, one who can write code but also understands data analysis and machine learning. MPP databases (column or not) are very poor at serving this type of developer who often is analyzing unstructured data, using algorithms that require too much CPU power to run in a database or datasets which are too large. The sheer amount of CPU power required to build some models makes running these algorithms in any sort of traditional sharded DB impossible.
My personal pipeline using hadoop typically looks like:
Run a number of very large global queries in Hadoop to get a basic feel for the data and the distribution of variables.
Use Hadoop to build a smaller dataset with just the data I am interested in.
Export the smaller dataset into a relational DB.
Run lots of small queries on the relational db, build excel sheets, sometimes do a little R.
Bear in mind that this workflow only works for the "analyst developer" or "data scientist". Others mileage will vary.
Coming back to your question due to people like me abandoning their tools these companies are looking for ways to remain relevant in an age where Hadoop is synonymous with big data, the coolest startups and cutting edge technology (whether this is earned or not you may discuss amongst yourselves.) Also many Hadoop installations are an order of magnitude or more larger than an organizations MPP deployments, meaning more data is being retained for longer in Hadoop.
Massive parallel database like Greenplum DB are excellent for handling massive amounts of structured data. Hadoop is excellent at handling even more massive amounts of unstructured data, e.g. websites.
Nowadays, a ton of interesting analytics combines these both types of data to gain insight. Therefore it is important for these database systems to be able to integrate with Hadoop.
For example you could do text processing on the Hadoop Cluster using MapReduce until you have some scoring value per product or something. This scoring value then could be used by the database to combine it with other data that is already stored in the database or data that has been loaded into the database from other sources.
Unstructured data, by their nature, is not suitable for loading into your traditional data warehouse. Hadoop mapreduce jobs can extract structures out of your log files (ex) and then the same can then be ported into your DW for analytics. Hadoop is batch processing, therefore is not suitable for analytic query processing. So you can process your data using hadoop to bring some structure, and then make it query ready via your visualization/sql layer.
What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?
The point is you would not want your users to fire up a query and wait for minutes, sometimes hours before you come back with an answer. Hadoop cannot provide you with a real time query response. Although this is changing with the advent of Cloudera's Impala and Hortonworks's Stinger. These are real-time data processing engines over Hadoop.
Hadoop's underlying data system, HDFS, allows chunking up your data and distributing it over the nodes in your cluster. In fact, HDFS can also be replaced with a 3rd party data storage like S3. Point is: Hadoop provides both -> storage + processing. So you are welcome to use hadoop as storage engine and extract the data into your data warehouse when needed. You can also use Hadoop to create cubes and marts and store these marts in the warehouse.
However, with the advent of Stinger and Impala, the strength of these claims will eventually be erased. So keep an eye out.
I used to think that Hive was just a SQL-like programming language used to make writing MapReduce-type jobs easier (i.e., a SQL-like version of Pig/Pig Latin). I'm reading more about it now, though, and apparently it's actually a full data warehouse infrastructure.
Is one of these use cases more common? That is, is it primarily used for the data warehouse infrastructure it provides, or more for the SQL-like interface? Or are both aspects of equal utility and importance?
(I'm asking because I'm trying to figure out what parts of Hive I should focus on learning about.)
That's exactly what I used to think too. Now that I've had about a month's experience with Hive, I now find that it's a great ETL tool... for a data warehouse later down the line.
Hive doesn't compare with MDX. Hive is very row-based and doesn't allow a lot of the messier operations that SQL or MDX (Multidimensional Expression Language, common in BI tools) are masters at.
We're using Hive as an ETL tool to integrate our different flat file data sources and reduce the amount of data we have to upload to a SQL-based data warehouse.
If that data only has a half-life spanning a couple of weeks, then we can keep the size of our database relatively manageable, always able to reproduce the reports later on from Hive.
Hive doesn't support updates. In our implementation we used straight MapReduce jobs for populating data warehouse and Hive for making exports for further processing or importing into relational data warehouses. We also used it as an intermediary for a BI reporting tool.