Recommendation for a large-scale data warehousing system

Recommendation for a large-scale data warehousing system - amazon-ec2

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.

Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck

#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.

Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.

Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Related

Creating real time datawarehouse

I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to full data mining processes and OLAP cubes.
I have read that a data warehouse must be a relational database, and understand this. What I don't understand is how to achieve a near real time, or fully real time DWH. I have read about push and pull strategies but my conclusions are the following:
The choice of DBMS is not important to create real time DWH. I mean that is possible with MySQL, SQL Server, Oracle or any other. As I am doing it as a personal project I choose MySQL.
The key factor is the frequency of the jobs scheduling, and this is task of the scheduler. Is this assumption correct? I mean, the key to create a real time DWH is to establish jobs every second for every ETL process?
If I am wrong can you provide me some help to understand this? And then, which is the way to create a real time DWH? Is the any open source scheduler that allows that? And any not open source scheduler which allows that?
I am very confused because some references say that this is impossible, others that is possible.

Definition
Very interesting question. First of all, it should be defined how "real-time" realtime should be. Realtime really has a very low latency for incoming data but requires good architecture in the sending systems, maybe a event bus or messaging queue and good infrastructure on the receiving end. This usually involves some kind of listener and pushing from the deliviering systems.
Near-realtime would be the next "lower" level. If we say near-realtime would be about 5 minutes delay max, your approach could work as well. So for example here you could pull every minute or so the data. But keep in mind that you need some kind of high-performance check if new data is available and which to get. If this check and the pull would take longer than a minute it would become harder to keep up with the data. Really depends on the volume.
Realtime
As I said before, realtime analytics require at best a messaging queue or a service bus some jobs of yours could connect to and "listen" for new data. If a new data package is pushed into the pipeline, the size of it will probably be very small and it can be processed very fast.
If there is no infrastructure for listeners, you need to go near-realtime.
Near-realtime
This is the part where you have to develop more. You have to make sure to get realtively small data packages which will usually be some kind of delta. This could be done with triggers if you have access to the database. Otherwise you have to pull every once in a while whereas your "once" will probably be very frequent.
This could be done on Linux for example with a simple conjob or on Windows with event planning. Just keep in mind that your loading and processing time shouldn't exceed the time window you have got until the next job is being started.
Database
In the end, when you defined what you want to achieve and have a general idea how to implement delta loading or listeners, you are right - you could take a relational database. If you are interested in performance and are modelling this part as Star Schema, you also could look into Column Based Engines or Column Based Databases like Apache Cassandra.
Scheduling
Also for job scheduling you could start with Linux or Windows standard planning tools. If you code in Java you could use later something like quartz. But this would only be the case for near-realtime. Realtime requires a different architecture as I explained above.

Use Vertica Database for OLTP data?

Can Vertica Database be used for OLTP data?
And if so what are the pros and cons on doing this?
Looking for a Vertica vs Oracle fight :)Since Oracle license is so costly, would Vertica do it job for a better price ?
thx all

Using Vertica as a transactional database is a bad idea. It's designed to be a data warehousing tool. Essentially, it reads and writes data in an optimized fashion. Lots of transactions? That's not what it is designed to do.
I would recommend that you look into VoltDB. Michael Stonebreaker who is the force behind Vertica founded that company as well. His basic philosophy is that Oracle, SQL Server, et al do not do well for high performance since they are designed to do everything. The future is having databases designed for specific tasks.
So he had some concepts for a data warehousing which became Vertica. For transactional databases, there's VoltDB. Not owned by HP, for the record.
For the record, I haven't used VoltDB. From what I know, it isn't as mature as Vertica is as a solution but it looks like it has a ton of promise.

HP Vertica is a column store database. The nature of the way that data is organised within a column store does not lend itself to rapid writes.
HP Vertica gets around this by having a WOS (Write Optimised Store) and ROS (Read Optimised Store which is file based).
Data is moved out of the WOS into the ROS fairly rapidly and the ROS itself has a "merge up" process that takes small ROS files and merges them together to form larger and therefore more easily scanned files.
If you tried to use Vertica for OLTP then what would happen would be that you'd get loads of ROS containers and possibly hit the default limit of 1024 ROS containers very quickly.
If you fronted the store with some form a queuing mechanism to pass through records in larger batches then this would result in fewer and larger ROS files. It would work but if you wanted to take your OLTP system to be reading very close to its writing activity it would not fit the use case.
The WOS/ROS mechanism is a neat work around for the fundamental performance penalty of writes in a column store DB but fundamentally Vertica is not an OLTP DB but rather a data mart technology that can ingest data in near real time

I think there are different ways to read into this question.
Can you use Vertica as an OLTP database?
First I'll define this question a bit. An OLTP database means the database itself is responsible for the transaction processing, not simply receiving somewhat normalized data.
My answer here is absolutely not, unless perhaps it is a single user database. There is practically no RI, no RI locking, table locks on DELETE/UPDATE, and you're likely to accumulate a delete vector in normal OLTP type usage.
You can work around some of these with some extensive middleware programming (distributed locks, heavy avoidance of DELETE/UPDATE, etc). But why? There are tons of options out there that are not Oracle, don't carry a huge price tag but give you everything you need for OLTP.
Can you use Vertica to ingest and query OLTP data?
Yes, definitely. Best to use Vertica towards its strengths, though. Queries in Vertica tend to have a fair amount of overhead, and you can plow through large amounts of data with ease, even normalized. I would not be using Vertica to primary run point queries, grabbing a few rows here and there. It isn't that you can't, but you can't with the same concurrency as other databases that are meant for this purpose.
TL;DR Use the right tool for the right job. I really love using Vertica, but just because I like to swing a hammer doesn't mean that every problem is a nail.

This question is a little old now but i'll share my experience.
I would not suggest vertica as OLTP unless you very carefully consider your workload.
As mentioned in other answers, Vertica has 2 types of storage. ROS is the Read Optimized Storage and WOS is the Write Optimized Storage. WOS is purely in memory so it performs better for inserts but queries slower as all the small updates need to be queried and unioned. Vertica can handle small loads in theory but in practice it didn't work out very well for us performance wise. Also there are drawbacks to WOS namely being that when the database fails WOS is not necessarily preserved when it rolls back to last good epoch. (ROS isn't either but in practice you lose a lot less from ROS).
ROS is a lot more reliable and gives better read performance but you will never be able to handle more than a certain number of queries without a careful design. Although vertica is horizontally scalable, in practice large tables get segmented across all nodes and therefore queries must run on all nodes. So adding more nodes doesn't mean handling more concurrent queries it just means less work per query. If your tables are small enough to be unsegmented then this might not be an issue for you.
Also worth noting is the OLTP typically implies lots concurrent transactions so you'll need to plan resource pools very carefully. By default vertica has a planned concurrency for the general resource pool of the minimum of number of cores per server or RAM/2GB. Essentially what this value does is determine the default memory allocation PER NODE for a segmented query. Therefore by default vertica will not let you run more queries than cores. You can adjust this value but once you hit a cap on memory theres no much you can do because the memory is allocated per node so adding more nodes doesn't even help. If you hit any errors at all for resource pool memory allocations that is the first config your should look at.
Additionally, Vertica is bad with deletes and updates (which resolve to a delete and an insert in the background) so if these are a regular part of your workload then Vertica is probably a bad choice. Personally we use MySQL for our dimension tables that require deletes/updates and then sync that data periodically into vertica to use for joins.
Personally I use Vertica as an OLTP-ish realtime-ish database. We batch our loads into 5 minute intervals which makes vertica happy in terms of how many/large the inserts are. These batches are inserted using COPY DIRECT so that they avoid WOS entirely (only do this if they are large batches as this forces ROS container creation and can be bad if you do it too often). As many projections as we can have are unsegmented to allow better scale out since this makes queries hit only 1 node and allocate memory on only 1 node. It has worked well for us so far and we load about 5 billion rows a day with realtime querying from our UI.

Up_one - considering the telecom use-case - are you doing CDR or something else?
To answer your original question yes Vertica may be a great fit but it depends on how you are loading the data, how you are doing updates, what your data size is and what your SLA is. I am really familiar in this space because I implemented Vertica at a telecom that I worked for at the time.

Hadoop as document store database

We have a large document store currently running at 3TB in space and it increments by 1 TB every six months. They are currently stored in a windows file system which has at times caused problems in terms of access and retrieval. We are looking to exploit a Hadoop based document store database. Is it a good idea to go ahead with Hadoop? Anyone has any exposure to the same? What can be the challenges, technology roadblocks in achieving the same?

Hadoop is more for batch processing that high data access. You should have a look at some NoSQL systems, like document oriented databases. Hard to answer without knowing what your data is like.
The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery. Obviously files in a folder are a bit different.
Also remember than in some systems such as HBase there is no indexing concept (I'm gussing you have file indexing setup on this windows FS document store). All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.But again depending on your data (are the files word docs or text docs with info about products, invoices or instruments or something)...
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. Look at;
MongoDB - Document - CP
CouchDB - Document - AP
Cassandra - Column Family - Available & Partition Tolerant (AP)
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

HDFS does not sound to be right solution. It is optimized for massive parralel processing of the data and not to be general purpose file system.
Specifically it has following limitations making it probabbly bad choice:
a) It is sensitive to the number of files. Practical limit should be about dozens of millions of files.
b) The files are read only, and can only be appended, but not edited. It is fine for analytical data processing but might not suite your need.
c) It has single point of failure - namenode. So its reliability is limited.
If you need system with comparable scalability, but not sensitive to number of files I would suggest OpenStack's Swift. It also does not have SPOF.

My suggestion is you can buy a NAS storage. May be EMS isilon kind of product you can consider.
Hadoop HDFS is not for file storage. It is storage to processing the data (for reports, analytics..)
NAS is for file sharing
SAN is more for a database
http://www.slideshare.net/jabramo/emc-sanoverviewpresentation
Declaration: I am not a EMC person, so you can consider any product. I just used EMC for reference.

Oracle (RAC) vs NoSQL

I am curious if anybody did benchmarks for accessing of data in NoSQL databases vs Oracle (particularly I am talking about Oracle RAC)?
The project requires to work with at least 10mil+ of records, search among them (but not necessary have to be real time), the read is very important for speed, and it's also very important to guarantee HA and reliability (can't lose records!!!)
I can see for myself how say Cassandra/MongoDB might be better fit (because key value storage will provide faster reads than SQL when you go over 10mil records), but I find difficult to articulate all of them nicely. Any links? Suggestions? Bullet points?
Thanks!

10 million records. Assume 250 bytes per record. That is about 2.5 Gb of data, which is well within the capacity of a basic desktop / laptop PC. The data volumes are insignificant (unless each record is sized in Mb, such as picture or audio).
What you do need to talk about is transaction volumes (separated into read and write) and what you consider HA. Read-only HA is easy relative to "Read-write HA". It can be trivial to replicate a read-only data set off to multiple servers at different geographic locations and distribute a query workload on them.
It's much harder to scale out an update heavy workload, which is why you often hear about systems going into meltdown when tickets for a big concert are released. Quite simply there's a fixed number of seats and you can't have ten duplicated systems each selling what they think is available. There has to be a single source of truth, which means a bottleneck (and potentially a single point of failure).
On the HA aspect, RAC is a shared storage technology which generally means your RAC nodes are in close proximity. That can make them vulnerable to localized events such as a building fire or telecoms breakdown. Data Guard is the Oracle technology that relates to off-site replication and failover.

Mostly when you come to comparison of NoSQL vs SQL, you have to understand a very important difference between them. Data in NoSQL may be inconsistent in cost order to achieve HA.
What do I mean by inconsistent? It depends, but usually around 3-5 seconds to propagate the data around nodes. NoSQL database provide mechanism to manage and eliminate that, but if you want all your data be consistent in real time, then you simply use classic SQL, like Oracle RAC.
Coming back to speed comparison: it's simply incomparable which one is faster, because it relays on factors like network infrastructure, computing power and database model etc. But important thing is that at some point you may reach the moment that SQL is economically inefficient to maintain and you have to switch to NoSQL.

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?

Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.

"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.

Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.

"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.

Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio