Exasol vs HBase - hadoop

I'm quite new to BigData architecture so please don't be to harsh on me.
I am trying to figure out the best alternative to build a BI Architecture able to deal with huge amounts of data. As I see it, the solution has to be clustered/horizontally scalable to cope with system growing. I would like to be able to interact with the system using SQL, so HBase + Hive (or even Pig, not for sql but not to need to manually write MR tasks) could be a solution. What would be the benefits/disadvantages of such an architecture opposed to, for instance, Exasolution and their In-Memory - MPP - Columnar solution.
Are there other alternatives which might have some extra-benefits? What about maintenance and configuration? Any Microsoft solution (I may find customer specific needs regarding this)
Sorry for posting such an open question, but I would like to see some discussion so that I can learn from you as much as possible.

Though being an EXASOL guy, I will not start to try to convince you that EXASOL is the one and only good solution out there. It heavily depends on the use case you are trying to implement, and the requirements you have to fulfill.
Hadoop is a very flexible, scalable system and used very often for storing and processing huge volumes of data.
EXASOL in contrast is a specialized RDBMS for complex analytic query processing.
I think that these two options don't really directly compete but complement each other. In many cases companies need a scalable data lake to store and preprocess there data, or to query it in rather simply ways. Once you want to enter the real-time business with complex analytics, where dozens, hundreds or even thousands of analysts are running lots of queries, then an in-memory RDBMS is a great choice.
King, the producer of Candy Crush, combines these two worlds to a powerful data management eco system. They store petabytes of data within Hadoop and use EXASOL on top as an in-memory layer for hundreds of terabytes of data. You can read more about that exciting use case here: http://bit.ly/1TR8APY
Another important difference of these two worlds is the complexity. While EXASOL is tuning-free because it is a specialized system (similar to an appliance) for a certain use case running SQL queries or R/Python/Java in-database-analytics, the Hadoop stack is much more complex. You'll need a certain level of know how to setup, maintain and tune this system. This doesn't need to be a reason for any of the two option. As mentioned, it heavily depends on what you want.
From a price perspective, Hadoop is free and so it should be much cheaper than an in-memory db such as EXASOL, right? Wait a minute, it's not that easy. Again, you have to consider the whole picture. How much data you really want to store, how much of that needs to be queried for analysis, how much hardware would you need to buy, how many people do you have to be hired and trained for the operation or the analytics deployed on the system.
Summary
To summarize my thoughts, the world is too complicated to directly compare these two technologies. Depending on the use case and your personal requirements, either one or the other could be the better option. And in my opinion, the trend in the market is combining such systems to a data mgmt eco systems where you get the best out of the two worlds... Actually three worlds, because the world of operational data processing of NoSQL solutions should also be mentioned here.
I hope that helped a bit. If you need any further details especially about EXASOL, don't hesitate to contact me or connect with me on LinkedIn: de.linkedin.com/in/exagolo

Related

Modern Business Intelligence solution

What is the modern way of building a Business Intelligence solution? I have looked at PowerBI, but I'm wondering what would be the best datasource for it. Is it still traditional datawarehouse solutions that should be used as a datasource? I also hear a lot talk about data lakes, but don't know much about. Or should I just use a regular relational database as the source? Do anyone have any opinions and tips on this?
I think your starting point in your thinking is wrong. You don't chose a front end BI / Dashboard tool and then think what source would be best to connect to it.
You start from your data & information that you want to analyze, report & visualize. Think of structure & variety of data and complexity of analysis, correlations, integrations & business logic.
Then decide how are you going to
Store the data
Process / Transform the data to correlate, integrate or enrich
report or visualize the data
And its only in step 3 from above high level tasks that you come to start thinking of which Analysis / visualization tool is best fit for such data & its integrations with data storage platform I have as well as nature of the data itself.
That will most likely bring you more success than thinking about it the way you posed that question.
I hope it helps.
Start with your data.
Do you have a data warehouse now? If no,
Where is your data, databases, Excel, email? Data in databases, like MySQL, is structured. Data in email or other documents is unstructured. Depending on where your data lives impacts how you will analyze it (which is what BI is all about, in the end.) (And a side note, data lakes are best for analyzing structured, semi-structured and unstructured data together. For example if you queried for data in documentation, a SQL DB and older MS Access data dumps.)
If you have data in different databases and systems, then I would recommend you start with a data warehouse. There are many options, one of the easier ones today is using a cloud-based solution (AWS, Microsoft, etc.)
Once your data is in a location(s) where it can be queried and analyzed as a total data set then you can look at the BI tools that fit your needs.
4.a. What type of analysis do you need? Queries? Trends? Complex data calculations and transformations?
Based on 4.a. look at the tools in the market. PowerBI is just one of a whole variety of data analysis tools and systems on the market. There are many resources on the web, Google ETL tools.
After all of this you can narrow down your choices and select the solution that works best for you.

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

Is a data warehouse a good solution for sharing customer data across technologies?

I am wanting to be able to share data across all areas of our business in a way that reduces the overall complexity of our infrastructure.
The Problem
Our problem is that we currently have 4 main applications that all connect to our CRM application (Microsoft Dynamics 2011):
The decision-makers at our firm are currently wanting to upgrade our CRM to the most current version and, then, stay up to date as new upgrades are released (every 2-3 years). Almost all of our applications are rigidly integrated with Microsoft Dynamics so each upgrade is very expensive and risky. I want to design another approach that will reduce this expense and risk.
Research
In 2006, Roger Sessions wrote an article called A Better Path to Enterprise Architectures (here) which outlines ways to better Business IT systems. One of the central themes in his discussion is reducing complexity, and by arranging die in different ways, he shows that you can exponentially reduce the complexity of systems by partitioning technologies into segments rather than letting any technology connect to any other technology. Jeanne Ross has a great presentation on this topic as well (here), and she talks about having a digitized platform for sharing core data and services between areas of the business in order to reduce complexity of the overall system and increase agility in responding to current and future business needs.
Conclusions
As I reflect on the lessons from Sessions and Ross, I am confident that we need to take Microsoft Dynamics out of the center of our architecture if we are wanting to overhaul the technology every 2-3 years. We'll just need replace it with something that will allow our core data (mostly customer data) to be shared across applications. I know that data warehouses are often used for aggregating data across the organization. Could this work?
I understand that data warehouses are mostly used for reporting, so I don't know if having direct connections to the data warehouse would be ideal. However, each application would not need the ability to update any data in the data warehouse. They just need the ability to grab their IDs to set up relationships between global, data warehouse entities (customers) and various unit-specific entities within each application's database.
Questions
Which of these three options would meet my needs: (1) a data warehouse to which all applications connect directly, (2) a data warehouse that feeds data to each application-specific database through overnight updates or (3) something else?
Thanks
What you're after is a data integration architecture - that doesn't necessarily mean a data warehouse. The pattern you're describing is called "hub and spoke," and it's very common - I'd say you're definitely on the right track for resolving the integration problem you're describing.
This page goes into this problem and pattern in much more depth, and it also has a section on the differences between data warehousing and data integration. You've noted that you're aware data warehouses are commonly used for reporting - that's true, and they're also used heavily for analytics, as the link discusses. They're traditionally a data source for business intelligence efforts. This can mean they're not always focused on the kind of data you're interested in - i.e. operational data which your systems need to function, but which might not be of interest for reporting or analytical purposes. Or, they might not function in a way that's helpful for your needs - for instance, traditional overnight ETL loads might not be the best solution if you need your applications to be up-to-date more quickly.
All this is to say that data warehouses can definitely be used as a data hub - the EDW becomes your "master data" source, any data quality processes needed run on the EDW data, and ETL processes fire corrected data back out to the various sources - but you'll probably be better served by researching the topic of data integration than the topic of data warehousing, even if the two share a lot of similarities and can overlap.
If you create a data warehouse without any business intelligence requirements, it might not function very well as a data warehouse. A very suitable data integration/master data solution might not resolve all of the future requirements you have for a data warehouse. Equally, if you were to create a traditional data warehouse after researching data warehousing best practices, it might not fulfill your data integration requirements, or fulfill them in the best way. As the link suggests, separate the two ideas: resolve your data integration problem, and if you want a data warehouse in the future, you can use your data integration solution to help populate it.

What is the difference between Big Data and Data Mining? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
As Wikpedia states
The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for
further use
How is this related with Big Data? Is it correct if I say that Hadoop is doing data mining in a parallel manner?
Big data is everything
Big data is a marketing term, not a technical term. Everything is big data these days. My USB stick is a "personal cloud" now, and my harddrive is big data. Seriously. This is a totally unspecific term that is largely defined by what the marketing departments of various very optimistic companies can sell - and the C*Os of major companies buy, in order to make magic happen. Update: and by now, the same applies to data science. It's just marketing.
Data mining is the old big data
Actually, data mining was just as overused... it could mean anything such as
collecting data (think NSA)
storing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the term data mining was actually coined; but where the focus is on new knowledge, not on learning of existing knowledge)
business rules and analytics
visualization
anything involving data you want to sell for truckloads of money
It's just that marketing needed a new term. "Business intelligence", "business analytics", ... they still keep on selling the same stuff, it's just rebranded as "big data" now.
Most "big" data mining isn't big
Since most methods - at least those that give interesting results - just don't scale, most data "mined" isn't actually big. It's clearly much bigger than 10 years ago, but not big as in Exabytes. A survey by KDnuggets had something like 1-10 GB being the average "largest data set analyzed". That is not big data by any data management means; it's only large by what can be analyzed using complex methods. (I'm not talking about trivial algorithms such a k-means).
Most "big data" isn't data mining
Now "Big data" is real. Google has Big data, and CERN also has big data. Most others probably don't. Data starts being big, when you need 1000 computers just to store it.
Big data technologies such as Hadoop are also real. They aren't always used sensibly (don't bother to run hadoop clusters less than 100 nodes - as this point you probably can get much better performance from well-chosen non-clustered machines), but of course people write such software.
But most of what is being done isn't data mining. It's Extract, Transform, Load (ETL), so it is replacing data warehousing. Instead of using a database with structure, indexes and accelerated queries, the data is just dumped into hadoop, and when you have figured out what to do, you re-read all your data and extract the information you really need, tranform it, and load it into your excel spreadsheet. Because after selection, extraction and transformation, usually it's not "big" anymore.
Data quality suffers with size
Many of the marketing promises of big data will not hold. Twitter produces much less insights for most companies than advertised (unless you are a teenie rockstar, that is); and the Twitter user base is heavily biased. Correcting for such a bias is hard, and needs highly experienced statisticians.
Bias from data is one problem - if you just collect some random data from the internet or an appliction, it will usually be not representative; in particular not of potential users. Instead, you will be overfittig to the existing heavy-users if you don't manage to cancel out these effects.
The other big problem is just noise. You have spam bots, but also other tools (think Twitter "trending topics" that cause reinforcement of "trends") that make the data much noiser than other sources. Cleaning this data is hard, and not a matter of technology but of statistical domain expertise. For example Google Flu Trends was repeatedly found to be rather inaccurate. It worked in some of the earlier years (maybe because of overfitting?) but is not anymore of good quality.
Unfortunately, a lot of big data users pay too little attention to this; which is probably one of the many reasons why most big data projects seem to fail (the others being incompetent management, inflated and unrealistic expectations, and lack of company culture and skilled people).
Hadoop != data mining
Now for the second part of your question. Hadoop doesn't do data mining. Hadoop manages data storage (via HDFS, a very primitive kind of distributed database) and it schedules computation tasks, allowing you to run the computation on the same machines that store the data. It does not do any complex analysis.
There are some tools that try to bring data mining to Hadoop. In particular, Apache Mahout can be called the official Apache attempt to do data mining on Hadoop. Except that it is mostly a machine learning tool (machine learning != data mining; data mining sometimes uses methods from machine learning). Some parts of Mahout (such as clustering) are far from advanced. The problem is that Hadoop is good for linear problems, but most data mining isn't linear. And non-linear algorithms don't just scale up to large data; you need to carefully develop linear-time approximations and live with losses in accuracy - losses that must be smaller than what you would lose by simply working on smaller data.
A good example of this trade-off problem is k-means. K-means actually is a (mostly) linear problem; so it can be somewhat run on Hadoop. A single iteration is linear, and if you had a good implementation, it would scale well to big data. However, the number of iterations until convergence also grows with data set size, and thus it isn't really linear. However, as this is a statistical method to find "means", the results actually do not improve much with data set size. So while you can run k-means on big data, it does not make a whole lot of sense - you could just take a sample of your data, run a highly-efficient single-node version of k-means, and the results will be just as good. Because the extra data just gives you some extra digits of precision of a value that you do not need to be that precise.
Since this applies to quite a lot of problems, actual data mining on Hadoop doesn't seem to kick off. Everybody tries to do it, and a lot of companies sell this stuff. But it doesn't really work much better than the non-big version. But as long as customers want to buy this, companies will sell this functionality. And as long as it gets you a grant, researchers will write papers on this. Whether it works or not. That's life.
There are a few cases where these things work. Google search is an example, and Cern. But also image recognition (but not using Hadoop, clusters of GPUs seem to be the way to go there) has recently benefited from an increase in data size. But in any of these cases, you have rather clean data. Google indexes everything; Cern discards any non-interesting data, and only analyzes interesting measurements - there are no spammers feeding their spam into Cern... and in image analysis, you train on preselected relevant images, not on say webcams or random images from the internet (and if so, you treat them as random images, not as representative data).
What is the difference between big data and Hadoop?
A: The difference between big data and the open source software program Hadoop is a distinct and fundamental one. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.
Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers.
Hadoop is one of the tools designed to handle big data. Hadoop and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a MapReduce set of functions and a Hadoop distributed file system (HDFS).
The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.
Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.
See the article posted at http://www.shareideaonline.com/cs/what-is-the-difference-between-big-data-and-hadoop/
Thanks
Ankush
This answer is really intended to add some specificity to the excellent answer from Anony-Mousse.
There's a lot of debate over exactly what Big Data is. Anony-Mousse called out a lot of the issues here around the overuse of terms like analytics, big data, and data mining, but there are a few things I want to provide more detail on.
Big Data
For practical purposes, the best definition I've heard of big data is data that is inconvenient or does not function in a traditional relational database. This could be data of 1PB that cannot be worked with or even just data that is 1GB but has 5,000 columns.
This is a loose and flexible definition. There are always going to be setups or data management tools which can work around it, but, this is where tools like Hadoop, MongoDB, and others can be used more efficiently that prior technology.
What can we do with data that is this inconvenient/large/difficult to work with? It's difficult to simply look at a spreadsheet and to find meaning here, so we often use data mining and machine learning.
Data Mining
This was called out lightly above - my goal here is to be more specific and hopefully to provide more context. Data mining generally applies to somewhat supervised analytic or statistical methods for analysis of data. These may fit into regression, classification, clustering, or collaborative filtering. There's a lot of overlap with machine learning, however, this is still generally driven by a user rather that unsupervised or automated execution, which defines machine learning fairly well.
Machine Learning
Often, machine learning and data mining are used interchangeably. Machine learning encompasses a lot of the same areas as data mining but also includes AI, computer vision, and other unsupervised tasks. The primary difference, and this is definitely a simplification, is that user input is not only unnecessary but generally unwanted. The goal is for these algorithms or systems to self-optimize and to improve, rather than an iterative cycle of development.
Big Data is a TERM which consists of collection of frameworks and tools which could do miracles with the very large data sets including Data Mining.
Hadoop is a framework which will split the very large data sets into blocks(by default 64 mb) then it will store it in HDFS (Hadoop Distributed File System) and then when its execution logic(MapReduce) comes with any bytecode to process the data stored at HDFS. It will take the split based on block(splits can be configured) and impose the extraction and computation via Mapper and Reducer process. By this way you could do ETL process, Data Mining, Data Computation, etc.,
I would like to conclude that Big Data is a terminology which could play with very large data sets. Hadoop is a framework which can do parallel processing very well with its components and services. By that way you can acquire Data mining too..
Big Data is the term people use to say how storage is cheap and easy these days and how data is available to be analyzed.
Data Mining is the process of trying to extract useful information from data.
Usually, Data Mining is related to Big Data for 2 reasons
when you have lots of data, patterns are not so evident, so someone could not just inspect and say "hah". He/she needs tools for that.
for many times lots of data can improve the statistical meaningful to your analysis because your sample is bigger.
Can we say hadoop is dois data mining in parallel? What is hadoop? Their site says
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
So the "parallel" part of your statement is true. The "data mining" part of it is not necessarily. You can just use hadoop to summarize tons of data and this is not necessarily data mining, for example. But for most cases, you can bet people are trying to extract useful info from big data using hadoop, so this is kind of a yes.
I would say that BigData is a modernized framework for addressing the new business needs.
As many people might know BigData is all about 3 v's Volume,Variety and Velocity. BigData is a need to leverage a variety of data (structured and un structured data) and using clustering technique to address volume issue and also getting results in less time ie.velocity.
Where as Datamining is on ETL principle .i.e finding useful information from large datasets using modelling techinques. There are many BI tools available in market to achieve this.

Normalize or Denormalize in high traffic websites

What are the best practices for database design and normalization for high traffic websites like stackoverflow?
Should one use a normalized database for record keeping or a normalized technique or a combination of both?
Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?
or
Should the main database be denormalized but with normalized views at the application level for fast database operations?
or some other approach?
The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.
The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.
So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.
Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.
That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.
The Art of Capacity Planning is pretty good.
You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html
First: Define for yourself what hight-traffic means:
50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?
Then calculate this down to probable peak page-views per minute and per seconds.
After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?
Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.
Relational-database can be, if fully optimized, amazingly fast, when joining tables!
A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.
(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)
The best best-practice answer is definitely: It depends ;-)
For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.
Neither matters if you aren't caching properly.

Resources