Will hadoop replace data warehousing? - hadoop

I've heard reports that Hadoop is poised to replace data warehousing. So I was wondering if there were actual case studies done with success/failure rates or if some of the developers here had worked on a project where this was done, either totally or partially?
With the advent of "Big Data" there seems to be a lot of hype with it and I'm trying to figure out fact from fiction.
We have a huge database conversion in the works and I'm thinking this may be an alternative solution.

Ok so there are a lot of success stories out there with Big Data startups, especially in AdTech, though it's not so much "replace" the old expensive proprietary ways but they are just using Hadoop first time round. This I guess is the benefit of being a startup - no legacy systems. Advertising, although somewhat boring from the outside, is very interesting from a technical and data science point of view. There is a huge amount of data and the challenge is to more efficiently segment users and bid for ad space. This usually means some machine learning is involved.
It's not just AdTech though, Hadoop is used in banks for fraud detection and various other transactional analysis.
So my two cents as to why this is happening I'll try to summarise with a comparison of my main experience, that is using HDFS with Spark and Scala, vs traditional approaches that use SAS, R & Teradata:
HDFS is a very very very effective way to store huge amounts of data in an easily accessible distributed way without the overhead of first structuring the data.
HDFS does not require custom hardware, it works on commodity hardware and is therefore cheaper per TB.
HDFS & the hadoop ecosystem go hand in glove with dynamic and flexible cloud architectures. Google Cloud and Amazon AWS have such rich and cheap features that completely eliminate the need for in house DCs. There is no need to buy 20 powerful servers and 100s TB of storage to then discover it's not enough, or it's too much, or it's only needed for 1 hour a day. Setting up a cluster with cloud services is getting easier and easier, there are even scripts out there that make doing it possible for those with only a small amount of sysadm/devops experience.
Hadoop and Spark, particularly when used with a high level statically typed language like Scala (but Java 8 is also OK-ish) means data scientists can now do things they could never do with scripting languages like R, Python and SAS. First they can wire up their modelling code with other production systems, all in one language, all in one virtual environment. Think about all the high velocity tools written in Scala; Kafka, Akka, Spray, Spark, SparkStreaming, GraphX etc, and in Java: HDFS, HBase, Cassandra - now all these tools are highly interoperable. What this means is for first time in history, data analysts can reliably automate analytics and build stable products. They have the high-level functionality they need, but with the predictability and reliability of static typing, FP and unit testing. Try building a large complicated concurrent system in Python. Try writting unit tests in R or SAS. Try compiling your code, watching the tests pass, and conclude "hey it works! lets ship it" in a dynamically typed language.
These four points combined means that A: storing data is now a lot lot cheaper, B: processing data is now a lot lot cheaper and C: human resource costs are much much cheaper as now you don't need several teams siloed off into analysts, modellers, engineers, developers, you can mash these skills together to make hybrids ultimately needing to employ less people.
Things won't change over night, currently the labour market is majorly lacking two groups; good Big Data DevOps and Scala engineers/developers, and their rates clearly reflect that. Unfortunately the supply is quite low even though the demand is very high. Although I still conjecture Hadoop for warehousing is much cheaper, finding talent can be a big costs that is restricting the pace of transition.

Related

How to do load and performance testing of Hadoop cluster?

Are there any tools to generate an automated scenario with a predefined ramp up of user requests (running same map-reduce job) and monitoring some specific metrics of Hadoop cluster under load? I am looking ideally for something like LoadRunner but free/open source tool.
The tool does not have to have a cool UI but rather an ability to record and save scenarios that include a ramp up and a rendezvous point for several users (wait until other users reach some point and do some action simultaneously).
The Hadoop distribution I am going to test is the latest MapR.
Searching internet did not bring any good free alternatives to HP LoadRunner. In case you had an experience with Hadoop (or MapR in particular) load testing, please share what tool you have used.
Every solution you will look at has both a tool quotient and a labor quotient in the total price. There are many open source tools which take the tool cost to zero but the labor charge is so high that your total cost to deliver will be higher than a purchase of a commercial tool with a lower labor charge. Also, many people look at performance testing tools as load generation alone, ignoring the automated collection of monitoring data and the analysis of the results where you can pin an increase in response times with a correlated use of resources at the same time. This is a laborious process made longer to do when you are using decoupled tools.
As you have mentioned LoadRunner, when you are provided a tool you should compare what is available in that tool to whatever you are provided. For instance,
there are Java, C, C++ & VB interfaces available in LoadRunner. You are going to find a way to exercise your map and reduce infrastructure. Compare the integrated monitoring capabilities (native/SNMP/terminal user with command line...) as well as analysis and reporting. Where capabilities do not exist you will either need to build the capability or acquire it elsewhere.
You have also brought up the concept of Rendezvous. You will want to be careful with its application in any tool. Unless you have a very large population the odds of Simultaneous collision in the same area of code/action at the same time becomes quite small. Humans are chaotic instruments, arriving and departing independently from one another. On the other hand, if you are automating an agent which is based upon a clock tick then rendezvous makes a lot more sense. Taking a look at your job submission logs by IP address can provide an objective model for how many are submitted simultaneously (rendezvous) versus how many are running concurrently. I audit a lot of tests and rendezvous is the most abused item across tools, resulting in thousands of lost engineering hours chasing engineering ghosts that would never occur in natural use.

Hadoop versus Supercomputer

I am not able to understand the real essence of hadoop.
If I have the enough resources to buy a supercomputer that can process petabytes of data, then why would I need a Hadoop infrastructure to manage such huge data?
The whole point of hadoop is to be able to process huge amount of data on commodity heterogeneous machines. This has nothing to rule out the use of super computers.
Having enough resources often make us dumb. Let me give you an example(don't worry, it involves Hadoop) which will make it clear. The cost of Cray's cheapest supercomputer, XC30-AC is $500,000(IIRC). And what is the cost of a decent computer with decent RAM, CPU and disk???And how much would you need to buy a bunch of them and use their power collectively???How much space and resources do you need to place and handle these machines???How difficult is it to find folks with decent programming skills so that they can write MR jobs for you???
These are just a few things. Hadoop is open source. Use it and tweak it as you wish. Get awesome support through the mailing list for free. Not only support, but suggestions as well. I hope you get the point.
Utilizing your resources wisely is more important than just having them.

Data Mining Library for MPI

Is there any Data Mining library, which is using (or can be used by) MPI (Massage Passing Interface)? I am looking for something similar to Apache Mahout but which can easily be integrated in a MPI environment.
The reason why I want to use MPI is that the configuration (compared to Hadoop) is easy.
Or does it not make sense to use MPI in a Data Mining scenario?
There is no reason why MPI (which is a concept, not a software itself!) necessarily is easier to install than Hadoop/Mahout. Indeed, the latter two currently are a mess, in particular because of their Java library chaos. Apache Bigtop tries to make them easier to install, and once you've figured out some basics it's quite ok.
However:
If your data is small (i.e. it can be processed on a single node), don't install a cluster solution, you pay for the overhead. Hadoop does not make much sense on single hosts. Use Weka, ELKI, RapidMiner, KNIME or whatever.
If your data is large, you will want to minimize data transfer. And this is where the strength of Hadoop/Mahout lies, minimizing data transfer. A typical message passing API cannot scale the same way for data-heavy operations.
There are some efforts such as Apache Hama that are quite similar to MPI stuff IMHO. It is based on messages, however they are bulk-processed via barrier synchronization. It might also have some message aggregation prior to sending to reduce traffic.
I strongly recommend graphlab. Currently graphlab, a Distributed Graph-Parallel API, has toolkits including
topic modeling
collaborative filtering
clustering
graphical model
http://docs.graphlab.org/toolkits.html
GraphLab is a graph-based, high performance, distributed computation framework written in C++. While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.
GraphLab Features:
A unified multicore and distributed API: write once run efficiently in both shared and distributed memory systems
Tuned for performance: optimized C++ execution engine leverages extensive multi-threading and asynchronous IO
Scalable: GraphLab intelligently places data and computation using sophisticated new algorithms
HDFS Integration: Access your data directly from HDFS
Powerful Machine Learning Toolkits: Turn BigData into actionable knowledge with ease
this idea doesn't make sense and I think you have some misconceptions, MPI is more for tightly coupled systems and i'm 99% sure won't send messages to an external location, you can however process or analyze the data with MPI much more quickly (depending on your hardware). My 2 cents is that you are better off using one of the AMQP protocol open source implementations ,I would say zeromq is your best bet and then processing all the data you get in R or python or if your data set is very very large MPI. Another option is that you can call serial libraries on different machines connected and running MPI given they all are connected to the internet seperately. R is real easy to call with MPI so is python.

Oracle setup required for heavy-ish load

I am trying to make a comparison between a system setup using Hadoop and HBase and achieving the same using Oracle DB as back end. I lack knowledge on the Oracle side of things so come to a fair comparison.
The work load and non-functional requirements are roughly this:
A) 12M transactions on two tables with one simple relation and multiple (non-text) indexes within 4 hours. That amounts to 833 transactions per second (TPS), sustained. This needs to be done every 8 hours.
B) Make sure that all writes are durable (so a running transaction survives a machine failure in case of a clustered setup) and have a decent level of availability? With a decent level of availability, I mean that regular failures such as disk and a single network interface / tcp connection drop should not require human intervention. Rare failures, may require intervention, but should be solved by just firing up a cold standby that can take over quickly.
C) Additionally add another 300 TPS, but have these happen almost continuously 24/7 across many tables (but all in pairs of two with the same simple relation and multiple indexes)?
Some context: this workload is 24/7 and the system needs to hold 10 years worth of historical data available for live querying. Query performance can be a bit worse than sub-second, but must be lively enough to consider for day-to-day usage. The ETL jobs are setup in such a way that there is little churn. Also in a relational setup, this workload would lead to little lock contention. I would expect index updates to be the major pain. To make a comparison as fair as possible, I would expect the loosest consistency level that Oracle provides.
I have no intention of bashing Oracle. I think it is a great database for many uses. I am trying to get a feeling for the tradeoff there is between going open source (and NoSQL) like we do and using a commercially supported, proven setup.
Nobody can answer this definitively.
When you go buy a car you can sensibly expect that its top speed, acceleration and fuel consumption will be within a few percent of values from independent testing. The same does not apply to software in general nor to databases in particular.
Even if you had provided exact details of the hardware, OS and data structures, along with full details of the amount of data stored as well as transactions, the performance could easily vary by a factor of 100 times depending on the pattern of usage (due to development of hot spots of record caching, disk fragmentation).
However, having said that there are a few pointers I can give:
1) invariably a nosql database will outperform a conventional DBMS - the reason d'etre for nosql databases is performance and parallelization. That does not mean that conventional DBMS's are redundant - they provide much greater flexibility for interacting with data
2) for small to mid range data volumes, Oracle is relatively slow in my experience compared with other relational databases. I'm not overly impressed with Oracle RAC as a scalable solution either.
3) I suspect that the workload would require a mid-range server for consistent results (something in the region of $8k+) running Oracle
4) While having a hot standby is a quick way to cover all sorts of outages, in a lot of cases, the risk/cost/benefit favours approaches such as RAID, multiple network cards, UPS rather than the problems of maintaining a synchronized cluster.
5) Support - have you ever bothered to ask the developers of an open source software package if they'll provide paid for support? IME, the SLAs / EULAs for commercial software are more about protecting the vendor than the customer.
So if you think its worthwhile considering, and cost is not a big issue, then the best answer would be to try it out for yourself.
No offense here, but if you have little Oracle knowledge there is really no way you can do a fair comparison. I've worked with teams of very experienced Oracle DBAs and sys admins who would argue about setups for comparison tests (the hardware/software setup variables are almost infinite). Usually these tests were justifications for foregone conclusions about infrastructure direction (money being a key issue as well).
Also, do you plan on hiring a team of Hadoop experts to manage your company's data infrastructure? Oracle isn't cheap, but you can find very seasoned Oracle professionals (from DBAs to developers to analysts), not too sure about hadoop admins/dbas...
Just food for thought (and no, I don't work for Oracle ;)

Recommendation for a large-scale data warehousing system

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Resources