Apache Kylin fault tolerance - hadoop

Apache Kylin looks like a great tool that will fill in the needs of a lot data scientists. It's also a very complex system. We are developing an in-house solution with exactly the same goal in mind, multidimensional OLAP cube with low query latency.
Among the many issues, the one I'm concerned of the most right now is about fault tolerance.
With large volumes of incoming transactional data, the cube must be incrementally updated, and some of the cuboids are updated over long period of time such as those with time dimension value at the scale of year. Over such long period, some piece of the complex system is guaranteed to fail, and how does the system ensure all the raw transactional records are aggregated into the cuboids exactly once, no more no less? Even each of the pieces has its own fault tolerance mechanism, it doesn't mean they will play together automatically.
For simplicity, we can assume all the input data are saved in HDFS by another process, and can be "played back" in any way you want to recover from any interruption, voluntary or forced. What are Kylin's fault tolerance considerations, or is it not really an issue?

There are data faults and system faults.
Data fault tolerance: Kylin partitions cube into segments and allows rebuild an individual segment without impacting the whole cube. For example, assume a new daily segment is built on daily basis and get merged into weekly segment on weekend; weekly segments merge into monthly segment and so on. When there is data error (or whatever change) within a week, you need to rebuild only one day's segment. Data changes further back will require rebuild a weekly or monthly segment.
The segment strategy is fully customizable so you can balance the data error tolerance and query performance. More segments means more tolerable to data changes but also more scans to execute for each query. Kylin provides RESTful API, an external scheduling system can invoke the API to trigger segment build and merge.
A cube is still online and can serve queries when some of its segments is under rebuild.
System fault tolerance: Kylin relies on Hadoop and HBase for most system redundancy and fault tolerance. In addition to that, every build step in Kylin is idempotent. Meaning you can safely retry a failed step without any side effect. This ensures the final correctness, no matter how many fails and retries the build process has been through.
(I'm also Apache Kylin co-creator and committer. :-)

Notes: I'm Apache Kylin co-creator and committer.
The Fault Tolerance point is really good one which we actually be asked from some cases, when they have extreme large datasets. To calculate again from begin will require huge computing resources, network traffic and time.
But from product perspective, the question is: which one is more important between precision result and resources? For transaction data, I believe the exactly number is more important, but for behavior data, it should be fine, for example, the distinct count value is approximate result in Kylin now. It depends what's kind of case you will leverage Kylin to serve business needs.
Will put this idea into our backlog and will update to Kylin dev mailing list if we have more clear clue for this later.
Thanks.

Related

Frequent Updates on Apache Ignite

I hope someone experienced with Apache Ignite can help guide my team towards the answer regarding a new setup with Apache Ignite.
Overall Setup
Data is continuously generated from many distributed sensors and streamed into our database. Each sensor may deliver many updates every second, but generally generates <10 updates/sec.
Daily the magnitude of the data is approx. 50 million records, per site.
Data Description
Each record consists of the following values
Sensor ID
Point ID
Timestamp
Proximity
where 1, is our ID of the sensor, 2 is an ID of some point on the site, and 3 is a proximity measurement from the sensor to the point.
Each second there is approx. 1000 such new records. A record is never updated.
Query Workload
Queries are fairly complex with significant (and dynamic) look-back in time. A query may require data from several sensors in one site, but the required sensors are determined dynamically. Most continuous queries only require data from the last few hours, but frequently it is necessary to query over many days.
Generally, we therefore have a write-once query-many scenario.
Initial Strategy
If we load data into primitive integer arrays in, e.g., java, the space consumption for a week approaches 5 GB. Because that is "peanuts" in the platforms of today, we intend to load all data onto all nodes in the Ignite cluster/distributed cache. In other words, use a replicated cache.
However, the continuous updates keep puzzling me. If I update the entire cache, I image quite substantial amounts of data needs to be transferred across the network every second.
Creating chunks for, say, each minute/hour is not necessarily going to work (well) either as each sensor can be temporarily offline, which will make it deliver stale data at some later point in time.
My question is therefore how to efficiently handle this stream of updates, while maintaining a consistent view of the data for the last 7-10 days.
My current, local, implementation is chunking the data into 1-hour chunks. When a new record for a given chunk arrives, the chunk is replaced with an updated chunk. This works well on a single machine but is likely too expensive in terms of network overhead in a cluster. I do not have an Ignite implementation, yet, so I have not been able to test this.
Ideally, each node in the ignite cluster would maintain its own copy of all data within the last X days, and apply the small update workload continuously.
So my question is, how would fellow Igniters approach this problem?
It sounds like you want to scale the load across multiple servers, but it's not possible with replicated caches, because each update will always update all nodes, and more nodes you have the more network traffic you will get. I think you should use partitioned caches instead and try adding nodes until the system is capable of handling the load.

Creating real time datawarehouse

I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to full data mining processes and OLAP cubes.
I have read that a data warehouse must be a relational database, and understand this. What I don't understand is how to achieve a near real time, or fully real time DWH. I have read about push and pull strategies but my conclusions are the following:
The choice of DBMS is not important to create real time DWH. I mean that is possible with MySQL, SQL Server, Oracle or any other. As I am doing it as a personal project I choose MySQL.
The key factor is the frequency of the jobs scheduling, and this is task of the scheduler. Is this assumption correct? I mean, the key to create a real time DWH is to establish jobs every second for every ETL process?
If I am wrong can you provide me some help to understand this? And then, which is the way to create a real time DWH? Is the any open source scheduler that allows that? And any not open source scheduler which allows that?
I am very confused because some references say that this is impossible, others that is possible.
Definition
Very interesting question. First of all, it should be defined how "real-time" realtime should be. Realtime really has a very low latency for incoming data but requires good architecture in the sending systems, maybe a event bus or messaging queue and good infrastructure on the receiving end. This usually involves some kind of listener and pushing from the deliviering systems.
Near-realtime would be the next "lower" level. If we say near-realtime would be about 5 minutes delay max, your approach could work as well. So for example here you could pull every minute or so the data. But keep in mind that you need some kind of high-performance check if new data is available and which to get. If this check and the pull would take longer than a minute it would become harder to keep up with the data. Really depends on the volume.
Realtime
As I said before, realtime analytics require at best a messaging queue or a service bus some jobs of yours could connect to and "listen" for new data. If a new data package is pushed into the pipeline, the size of it will probably be very small and it can be processed very fast.
If there is no infrastructure for listeners, you need to go near-realtime.
Near-realtime
This is the part where you have to develop more. You have to make sure to get realtively small data packages which will usually be some kind of delta. This could be done with triggers if you have access to the database. Otherwise you have to pull every once in a while whereas your "once" will probably be very frequent.
This could be done on Linux for example with a simple conjob or on Windows with event planning. Just keep in mind that your loading and processing time shouldn't exceed the time window you have got until the next job is being started.
Database
In the end, when you defined what you want to achieve and have a general idea how to implement delta loading or listeners, you are right - you could take a relational database. If you are interested in performance and are modelling this part as Star Schema, you also could look into Column Based Engines or Column Based Databases like Apache Cassandra.
Scheduling
Also for job scheduling you could start with Linux or Windows standard planning tools. If you code in Java you could use later something like quartz. But this would only be the case for near-realtime. Realtime requires a different architecture as I explained above.

Reasons against using Elasticsearch as an OLAP cube

At first glance, it seems that with Elasticsearch as a backend it is easy and fast to build reports with pivot-like functionality as used in traditional business intelligence environments.
By "pivot-like" I mean that in SQL-terms, data is grouped by one to two dimensions, filtered, ordered by one or two dimensions and aggregated by several metrics e.g. with sum or count.
By "easy" I mean that with a sufficiently large cluster, no pre-aggregation of the data is required, which saves ETLs and data engineering time.
By "fast" I mean that due to Elasticsearch's near real time capability report latency can be reduced in many instances, when compared to traditional business intelligence systems.
Are there any reasons, not to use Elasticsearch for the above purpose?
ElasticSearch is a great alternative to a cube, we use it for that same purpose today. One huge benefit is that with a cube you need to know what dimensions you want to create reports on. With ES you just shove in more and more data and figure out later how you want to report on it.
At our company we regularly have data go through the following life cycle.
record is written to SQL
primary key from SQL is written to RabbitMQ
we respond back to the customer very quickly
When Rabbit has time, it uses the primary key to gather up all the data we want to report on
That data is written to ElasticSearch
A word of advice: If you think you might want to report on it, get it from the beginning. Inserting 1M rows into ES is very easy, updating 1M rows is a bigger pain.

How to distribute data and computation to maximize locality?

Please bear with me, this is a basic architectural question for my first attempt at a "big data" project, but I believe your answers will be of general interest to anyone who is starting out in this field.
I've googled and read the high-level descriptions of Kafka, Storm, Memcached, MongoDB, etc., but now that I'm ready to dig in to start designing my app, I still need some further insight on how in fact the data should be distributed and shared.
The performance of my app is critical, so one objective is to somehow maximize the locality of the data in the RAM of the machines doing the distributed calculations. I need advice for this part of the design.
If my app had some clear criteria for a priori sharding the data and distributing the calculations (such as geographical regions or company divisions) then the solution would be obvious. But unfortunately my app's data access patterns are dynamic and depend on the results of previous calculations.
My app is an analysis program with distinct stages. In the first stage, all the data is accessed once and a metric is calculated for each data object. In the second stage, a subset of the data objects may be accessed, with the probability of access being proportional to each data object's metric that was calculated in the previous stage. In the final stage, a relatively small subset of data objects will be accessed many times for many calculations.
At all stages, it is required that the calculations be distributed across several servers. The calculations are embarassingly parallel, and each distributed calculation only needs to access a few data objects. It is also required that the number of servers can be specified before the app runs (for example, run on one server, or run on fifty servers).
It seems to me that I need some mechanism that distributes the appropriate data objects to the appropriate compute servers, as opposed to just blindly fetching the data from some database service (whether centralized or distributed). Also, it seems to me that some sort of smart caching system might be appropriate, since the data access pattern depends on the previous calculations and cannot be predicted a priori. But as far as I can tell, Memcached is not such a system because the sharding is determined a priori.
I've read many times that the operating system cache performs better than any monkeying around that we may try. I think the ideal solution is that each compute server's RAM cache somehow captures the data objects' dynamic access patterns, but it's not clear to me how this would work with a NoSQL or Memcached service.
Thanks for bearing with me this far. I realize this is a basic question, but the answer eludes me so far. I can't resolve the dynamic access patterns of my app with the a priori sharding of the NoSQL/Memcached packages. Any advice would be greatly appreciated.
I recommend you to take a look at http://tarantool.org. Shard to maximize locality for the most common data access pattern, use Lua for local computations, and net.box to issue a remote RPC when calculation needs to continue on another node. All data is stored in RAM, if you write your computation code carefully it could take advantage of the Just In Time compiler.

Recommendation for a large-scale data warehousing system

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Resources