Usually, data warehouses in the context of big data are managed and implemented on the basis of Hadoop-based system, like Apache Hive (right?).
On the other hand, my question regards the methodological process.
How do big data affect the design process of a data warehouse?
Is the process similar or new tasks must be considered?
Hadoop is similar in architecture to MPP data warehouses, but with some significant differences. Instead of rigidly defined by a parallel architecture, processors are loosely coupled across a Hadoop cluster and each can work on different data sources.
The data manipulation engine, data catalog, and storage engine can work independently of each other with Hadoop serving as a collection point. Also critical is that Hadoop can easily accommodate both structured and unstructured data. This makes it an ideal environment for iterative inquiry. Instead of having to define analytics outputs according to narrow constructs defined by the schema, business users can experiment to find what queries matter to them most. Relevant data can then be extracted and loaded into a data warehouse for fast queries.
The Hadoop ecosystem starts from the same aim of wanting to collect together as much interesting data as possible from different systems, but approaches it in a radically better way. With this approach, you dump all data of interest into a big data store (usually HDFS – Hadoop Distributed File System). This is often in cloud storage – cloud storage is good for the task, because it’s cheap and flexible, and because it puts the data close to cheap cloud computing power. You can still then do ETL and create a data warehouse using tools like Hive if you want, but more importantly you also still have all of the raw data available so you can also define new questions and do complex analyses over all of the raw historical data if you wish. The Hadoop toolset allows great flexibility and power of analysis, since it does big computation by splitting a task over large numbers of cheap commodity machines, letting you perform much more powerful, speculative, and rapid analyses than is possible in a traditional warehouse.
We have data (not allot at this point) that we want to transform/aggregate/pivot up to wazoo.
I had a look on the www and all the answers i am asking is pointing to hadoop for scalable,cheap to run(no SQL server machine and license),fast(if you have allot of data), programmable(not little boxes that you drag around).
There is just one problem that i keep coming up against
namely 'Use hadoop if you have more than 10gb of data'
Now we don't even have 1gb of data(at this stage) is it still viable.
My other option is SSIS. Now we do use SSIS for some of our current ETL but we don't have resources for it and putting a SQL in the cloud is just going to cost to much and don't even get me started on scalability cost and config.
thanks
Your current data volume seems to be too low for making an entry into hadoop. Enter into hadoop ecosystem only if you are dealing with huge volume of data(TB/year) and if you suspect the data volume to increase exponentially down the line.
Let me explain why I suggest against hadoop for such low volume of data.
By default hadoop stores your files into 128MB chunks of data and while processing also, it takes 128MB Chunks at a time to process(parallely). If your business requirement involves heavy CPU intensive processing, then you can decrease the input chunk size from 128MB to less. But then again by decreasing the amount of data to be processed parallely, you'll end up increasing the number of IO seaks(low level block storage). At the end you might be spending more resource on managing the tasks rather than what the actual task is taking. Hence, try avoiding distributed computing as a solution for your(low) data volume.
As #Makubex has suggested, don't use hadoop.
And SISS is a good option as it handles the data in-memory so it would perform data aggregations, data type conversions, merging, etc at a much faster rate than writing to the disk using temporary tables in stored procedures.
Hadoop is meant for large amounts of data I would suggest it only for data in terabytes. It would be way slower that SISS(which runs in-memory) for small data-sets.
Refer: When to use T-SQL or SSIS for ETL
In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.
Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?
Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?
why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA
I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework.
What is the better solution. I would like to read your opinion.
I think the advantages of using a MySQL Cluster are:
high availability
good scalability
high performance / real time data access
you can use commodity hardware
And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?
The advantages of Hadoop with Hive on top of it are:
also good scalability
you can also use commodity hardware
the ability to run in heterogenous environments
parallel computing with the MapReduce framework
Hive with HiveQL
and the disadvantage is:
no real time data access. It may takes minutes or hours to analyze the data.
So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?
Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.
Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.
If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?
Hadoop is not a replacement of MySQL, so I think they have their own scenario。
Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase.
If you wanna choose a offline compute & storage arch.
I suggest hadoop not MySQL cluster for offline compute & storage, because of :
Cost : obviously, hadoop cluster is more cheap than MySQL cluster
Scalability : hadoop support more than ten thousands machine in a cluster
Ecosystem : mapreduce, hive, pig, sqoop and etc.
So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture.
The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.
The answer to this question has a good explanation of this distinction.
What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?
All thse vendor keep saying "we can connect with Hadoop", but I don't understand what's the point. What is the interest of storing in Hadoop and transfering into InfoBright ? Why not have the applications store directly in the Infobright/Vertica DW ?
Thank you !
Why combine the solutions? Hadoop has some great capabilities (see url below). These capabilities though do not include allowing business users to run quick analytics. Queries that take 30 minutes to hours in Hadoop are being delivered in 10’s of seconds with Infobright.
BTW, your initial question did not presuppose an MPP architecture and for good reason. Infobright customers Liverail, AdSafe Media & InMobi, among others, utilize IEE with Hadoop.
If you register for an Industry White Paper http://support.infobright.com/Support/Resource-Library/Whitepapers/ you will see a view of the current marketplace where four suggested Use Cases for Hadoop are outlined. It was authored by Wayne Eckerson , Director of Research, Business Applications and Architecture Group, TechTarget, in September 2011.
1) Create an online archive.
With Hadoop, organizations don’t have to delete or ship the data to offline storage; they can keep it online indefinitely by adding commodity servers to meet storage and processing requirements. Hadoop becomes a low-cost alternative for meeting online archival requirements.
2) Feed the data warehouse.
Organizations can also use Hadoop to parse, integrate and aggregate large volumes of Web or other types of data and then ship it to the data warehouse, where both casual and power users can query and analyze the data using familiar BI tools. Here, Hadoop becomes an ETL tool for processing large volumes of Web data before it lands in the corporate data warehouse.
3) Support analytics.
The big data crowd (i.e., Internet developers) views Hadoop primarily as an analytical engine for running analytical computations against large volumes of data. To query Hadoop, analysts currently need to write programs in Java or other languages and understand MapReduce, a framework for writing distributed (or parallel) applications. The advantage here is that analysts aren’t restricted by SQL when formulating queries. SQL does not support many types of analytics, especially those that involve inter-row calculations, which are common in Web traffic analysis. The disadvantage is that Hadoop is batch-oriented and not conducive to iterative querying.
4) Run reports.
Hadoop’s batch-orientation, however, makes it suitable for executing regularly scheduled reports. Rather than running reports against summary data, organizations can now run them against raw data, guaranteeing the most accurate results.
There are several reasons you may want to do that
1. Cost per TB. The storage costs in Hadoop are much cheaper than Vertica/Netezza/greenplum and the like). You can get long-term retention in Hadoop and shorter term data in the analytics DB
2. Data ingestion capabilities in hadoop (performing transformations) is better in Hadoop
3. programatic analytics (libraries like Mahout ) so you can build advanced text analytics
4. dealing with unstructured data
The MPP dbs provide better performance in ad-hoc queries, better dealing with structured data and connectivity to traditional BI tools (OLAP and reporting) - so basically Hadoop complements the offering of these DBs
Hadoop is more of a platform than a DB.
Think of Hadoop as a neat filesystem that supports lots of queries over different of file types. With this in mind, most people dump raw data onto Hadoop and use it as a staging layer in the data pipeline, where it can chew the data and push it to other systems like vertica or any other. You have several advantages that can be resumed to decoupling.
So Hadoop is turning into the facto storage platform for big data. It is simple, fault-tolerant, scales well, and it is easy to feed and to get data out of it. So most vendors are trying to push a product to companies that probably have a Hadoop installation.
What makes the joint deployment so effective for this software ?
First, both platforms have a lot in common:
Purpose-built from scratch for Big Data transformation and analytics
Leverage MPP architecture to scale out with commodity hardware,
capable of managing TBs through PBs of data
Native HA support with low administration overhead
Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.
By using Vertica’s Hadoop connector, users can easily move data between the two platforms. Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS. A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.
What are the Key differences that make Hadoop and Vertica complement each other when addressing Big Data.
Interface and extensibility
Hadoop
Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community.
Vertica
Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc). This enables both technologists and business analysts to leverage Vertica in their analytic use cases. The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
Tool chain/Eco system
Hadoop
Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging.
Vertica
Vertica integrates with the BI tools because of its standards compliant interface. Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
Storage management
Hadoop
Hadoop replicates data 3 times by default for HA. It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs.
Vertica
Vertica’s columnar compression often achieves 10:1 in its compression ratio. A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries. Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
Runtime optimization
Hadoop
Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O.
Vertica
The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime. As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
Auto tuning
Hadoop
The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs.
Vertica
The Vertica Database Designer provides automatic performance tuning given an input workload. Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.
I'm not a Hadoop user (just a Vertica user/DBA), but I would assume the answer would be something along these lines:
-You already have a setup using Hadoop and you want to add a "Big Data" database for intensive analytical analysis.
-You want to use Hadoop for non-analytical functions and processing and a database for analysis. But it is the same data, so no need for two feeds.
To expand slightly on Arnon's answer, Hadoop has been recognized as a force that is not going away and is gaining increasing traction in organizations, many times via grassroots efforts from developers. MPP databases are good at answering questions that we know about at design time such as "How many transactions do we get per hour by country?".
Hadoop started as a platform for a new type of developer that lives somewhere between analysts and developers, one who can write code but also understands data analysis and machine learning. MPP databases (column or not) are very poor at serving this type of developer who often is analyzing unstructured data, using algorithms that require too much CPU power to run in a database or datasets which are too large. The sheer amount of CPU power required to build some models makes running these algorithms in any sort of traditional sharded DB impossible.
My personal pipeline using hadoop typically looks like:
Run a number of very large global queries in Hadoop to get a basic feel for the data and the distribution of variables.
Use Hadoop to build a smaller dataset with just the data I am interested in.
Export the smaller dataset into a relational DB.
Run lots of small queries on the relational db, build excel sheets, sometimes do a little R.
Bear in mind that this workflow only works for the "analyst developer" or "data scientist". Others mileage will vary.
Coming back to your question due to people like me abandoning their tools these companies are looking for ways to remain relevant in an age where Hadoop is synonymous with big data, the coolest startups and cutting edge technology (whether this is earned or not you may discuss amongst yourselves.) Also many Hadoop installations are an order of magnitude or more larger than an organizations MPP deployments, meaning more data is being retained for longer in Hadoop.
Massive parallel database like Greenplum DB are excellent for handling massive amounts of structured data. Hadoop is excellent at handling even more massive amounts of unstructured data, e.g. websites.
Nowadays, a ton of interesting analytics combines these both types of data to gain insight. Therefore it is important for these database systems to be able to integrate with Hadoop.
For example you could do text processing on the Hadoop Cluster using MapReduce until you have some scoring value per product or something. This scoring value then could be used by the database to combine it with other data that is already stored in the database or data that has been loaded into the database from other sources.
Unstructured data, by their nature, is not suitable for loading into your traditional data warehouse. Hadoop mapreduce jobs can extract structures out of your log files (ex) and then the same can then be ported into your DW for analytics. Hadoop is batch processing, therefore is not suitable for analytic query processing. So you can process your data using hadoop to bring some structure, and then make it query ready via your visualization/sql layer.
What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?
The point is you would not want your users to fire up a query and wait for minutes, sometimes hours before you come back with an answer. Hadoop cannot provide you with a real time query response. Although this is changing with the advent of Cloudera's Impala and Hortonworks's Stinger. These are real-time data processing engines over Hadoop.
Hadoop's underlying data system, HDFS, allows chunking up your data and distributing it over the nodes in your cluster. In fact, HDFS can also be replaced with a 3rd party data storage like S3. Point is: Hadoop provides both -> storage + processing. So you are welcome to use hadoop as storage engine and extract the data into your data warehouse when needed. You can also use Hadoop to create cubes and marts and store these marts in the warehouse.
However, with the advent of Stinger and Impala, the strength of these claims will eventually be erased. So keep an eye out.