How do I ensure consistency of aggregates with high availability? - caching

My team needs to find a solution to the following problem:
Our application allows users to view total sales for the enterprise, totals by product, totals by region, totals by region x product, totals by regions x division, etc. You get the idea. There are so many values that need to be aggregated to get many of those totals that they cannot be computed on the fly - we have to pre-aggregate them to provide decent response times, a process that takes about 5 minutes.
The problem, which we thought was a common one but can find no references to, is how to allow updates to various sales without shutting off the users. Also, the users cannot accept eventual consistency - if they drill down on a total of 12 they better see numbers that add up to 12. So we need Consistency + Availability.
The best solution we've come up with so far is to direct all queries to a redundant database, "B" (optimized for queries) while updates are directed to the primary database, "A". When we decide to spend the 5 minutes to update all the aggregates, we update database "C", which is yet another redundant database just like "B". Then, new user sessions get directed to "C", while existing user sessions continue to use "B". Eventually, warning anyone left using "B", we kill the sessions on "B" and re-aggregate there, swapping the roles of "B" and "C". Typical drain-stop scenario.
We are surprised that we cannot find any discussion of this and are concerned that we are over-engineering this problem or maybe it's not the problem we think it is. Any advice is greately appreciated.

This was an interesting problem so I thought about it on the train, and I came up with the idea of storing a timestamp for each row in the database that you aggregate over. (I think this technique has a name, but it escapes me and googling isn't finding it...)
The timestamp would indicate when this row was inserted. In addition:
-If rows can be updated, then you will have two 'versions' of the row at once, one more recent than the other.
-If rows can be deleted, then there will need to be a 'deleted version' row that specifies when it was deleted.
Now you can do things such as:
1) Say you update the aggregates at Jan 1 2000 midnight. You can have views of the table return the table's data as though it was Jan 1 2000 midnight, ignoring all inserts/updates/deletes more recent than that. Now the aggregates are as up to date as the data in the view AND you can keep adding data to the underlying table.
2) I don't know how feasible/easy to guarantee it's reliable this would be, but you could have 'differentially computed aggregates' where on Jan 2 2000 midnight, you take the aggregates of Jan 1 2000 midnight and update them only with the data that has been changed since that time - saving you from recomputing so much historical data. (Of course, it gets hairier once you consider rows being updated or deleted that are older than 24 hours)
3) Whenever you bring your aggregates up to date, you can merge updated and deleted rows with their older version and get rid of the older version, so you only have to keep duplicates of rows around when you need them to separate rows that have been aggregated and rows that aren't (this also means that, for instance, if all your aggregates run at once, and you update a row three times in quick succession, you only need to keep the most recent update-indicating row)

If updates cannot be computed on the fly, then caching of results sets as you are doing in another database helps solve the issue of availability with faster response times.
For consistency, you may be able to make use of some form of transaction isolation. For example, MySQL supports a number of different transaction levels, of which REPEATABLE READ may go close to providing you with some consistency in a single transaction. If a transaction can be left open for multiple requests as the users drill down to see the data, they effectively see a snapshot of the database state as of the first request.
In a more generic sense, you're just after a handle which to the data which is provided by the client to indicate a consistent set. As in Patashu's answer, the handle for a client requesting a set of aggregates could be time based. The first stage of client interaction would be to get a handle to the latest aggregate data, eg the current time. If would then pass that handle with each request. As requests are made of the server, it uses the handle to determine which set of aggregate data to return. Rather than having both server "B" and "C", all aggregate data could be stored in server "B", with all aggregate data containing the handle information. This then allows requests to a single server for aggregate data both new and old. At some point, old aggregate data could be purged from "B".
Perhaps a search on transaction isolation will turn up more results for discussion on consistency.

I think you're looking for Data Warehousing concepts
In computing, a data warehouse or enterprise data warehouse (DW, DWH,
or EDW) is a database used for reporting and data analysis. It is a
central repository of data which is created by integrating data from
one or more disparate sources. Data warehouses store current as well
as historical data and are used for creating trending reports for
senior management reporting such as annual and quarterly comparisons.
...
Unlike the ETL-based data warehouse, the integrated source data
systems and the data warehouse are all integrated since there is no
transformation of dimensional or reference data. This integrated data
warehouse architecture supports the drill down from the aggregate data
of the data warehouse to the transactional data of the integrated
source data systems.

Related

Oracle - Frequently accessed tables

We're running Oracle 12c SE. I've read a lot of postings that say v$segment_statistics may have information on which are the most frequently queried or updated objects. However, can that be broken down? Say that one might want to see during what times of the day certain objects are hotter than other objects, or perhaps number of physical reads or writes per hour for a given table?
Does Oracle SE offer this an any of the v$ views?
It sounds like you are describing dba_hist_seg_stat which is one of the tables that is populated as part of the Automatic Workload Repository (AWR). If you're on the standard edition, I don't believe querying these views would violate your license agreement but I don't keep up to date with changes to licensing terms particularly for the standard edition.
You could replicate this functionality yourself by putting together a job that runs every few minutes, queries v$segment_statistics, and writes the delta from the prior snap to a custom table. You could then query that table to see what activity was going on at different points in time.

Business Data Preparation for Reporting and BI

I am doing some research about what the best possible state that data should be in so that reporting and BI analytics perform well but can be produced by business users from a set of various data collections which align with a business data glossary that I have worked through.
We have not chosen a specific BI tool but have been playing around with Power BI and Sisense
We have not decided on a data store technology to use for reporting purposes
Origin Data
Our business application that the data will originate from has a normalised SQL relational database. There are quite a few tables and joins to consider which work fine from an application perspective but I have recommended supplying the output of those queries as a flat denormalised set of data to increase redundancy and remove the joins entirely.
Business Data Glossary
As we go through defining the business data glossary, the number of columns increases but I do not anticipate there being any more than 100 columns per row as a complete reporting set of data. I wanted to ensure that each row of data is at a transactional depth (level 0) and that the roll up through the data would be done through aggregations by distinct key values and dimensional taxonomy.
Architecture
I want some advice around what a modern architecture looks like and what works for business users rather than users who are comfortable with SQL queries and a myriad of joins on a physical data model.
I read an article about setting up data flows for Power BI which looked like they type of thing I want to do from a data availability perspective but it doesn't advice on how the data should be stored and what type of database to use.
Data Sets
The data we have that needs to be reported on are transactions where level 0 is trade positions (individual transactions from either a local or counterparty entity), level 1 is reconciliations (relating local and counterparty entities and trade linking identifier) and level 2 would be where it can be rolled up by taxonomy like asset type or status.
The current data set size would be a snapshot of positions every business day so, its duplicated every day with a snapshot date applied. The reports would be able to move across dates and show changes over time.
Any advice would be greatly appreciated on how to tackle reporting and BI in 2020. Oooh, one last thing, there is the possibility that we won't be allowed to process this type of data in the public cloud, we have our own infrastructure which is on private cloud so, that might need to be a consideration. Thanks

Oracle 11g - Building a Type 2 SCD based on existing historical data in a relational model

I'm an ETL developer that's currently being tasked with developing a type 2 SCD from existing historical data in a relational database. I'm perfectly capable of creating a type 2 SCD that's responsible for tracking future changes to the data, but I'm completely useless when it comes to the task at hand.
The relational model is in our ODS . Based on that relational model, I'm supposed to build flat records in our DW dimension. There are multiple attributes which need to be monitored for changes, each in specific related tables in the relational model. Historical changes must be kept on a daily basis, and if multiple changes to the same attribute occur on the same day, only the last subsists.
How can I tackle this? I'm lost. Thanks in advance.
P.S. we're talking tables with 20-30 million rows and multiple attributes that may change at any given time and therefore must result in a new record in the SCD.
This will indeed be painful. I'm assuming from your question that the tables containing the attribute values are currently varying independently (or you wouldn't need to ask the question).
If you have a table 'Table1' containing 'Key', 'Attribute1' and 'Effective From','Effective To' columns, then you can 'explode' that table into a virtual table in the form 'Key','Attribute1','Date', projecting out one row for every date where that attribute was current.
(Note that you probably don't want to do this as a ranged join against your date dimension, because this will be a Triangular Join (ie perform really badly), you probably need to explode the rows in an ETL tool/programmatically)
If you perform this process across multiple tables, you will have a set of tables giving you the full day-by-day snapshot of each attribute for every day that you care about. It's then fairly easy to join those tables based on 'FK' and 'Date' to give you the complete daily snapshot across all of the attribute values.
Then, of course, you need to run this though another process to collapse rows with the same Key, contiguous dates and all the same attribute values, ie 'unexplode' the rows, back into 'effective from','effective to' form. Note again, that this is fundamentally a row-by-row operation (or at very least a windowing function), and a set-based approach will perform very badly. Personally I'd just stream it all though some .net/java code to achieve this.
Given data volumes this will take a while, but should be achievable.

Simulating server-side group and sort in Azure table storage

I have a table to which I add records whenever the user views a particular resource. The key fields are
Username
Resource
Date Viewed
On a history page of my app, I want to present a set number (e.g., top 5) of the user's most recently viewed Resources, but I want to group by Resource, so that if some were viewed several times, only the most recent of each one is shown.
To be clear, if the raw data looked like this:
UserA | ResourceA | Jan 1
UserA | ResourceA | Jan 2
UserA | ResourceB | Jan 3
UserA | ResourceA | Jan 4
...
...only the bottom two records would appear in the history page.
I know you can get server-side chronological sorting by using a string derived from the date in the PartitionKey or RowKey fields.
I also see that you could enable a crude grouping mechanism by using Username and Resource as your PartitionKey and RowKey fields, and then using Insert-or-update, to maintain a table in which you kept pointers for the most recent value for each combination. However, those records wouldn't be sorted chronologically.
Is there any way to design a set of tables so that I can get the data I need without retrieving tons of extra entities and sorting on the client? I'm willing to get elaborate with the design if that's what it takes. Thanks in advance!
First, I would strongly recommend that you read this excellent Azure Storage Table Design Guide: Designing Scalable and Performant Tables document from Storage team.
Yes, I would agree that it is somewhat tricky with Azure Table Storage but it is doable :).
What you have to do is keep multiple copies of the same data. Each copy will serve a different purpose.
Considering the scenario where you want to fetch most recent lines for Resource A and B, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks) reversed i.e. DateTime.MaxValue.Ticks - LastAccessedDateTime.Ticks. Reverse ticks is required to that most recent entries will show up on the top of the table.
RowKey: Resource name.
AccessDate: Indicates the last access date/time.
User: Name of the user who accessed that resource.
So when you are interested in just finding out most recently used resources, you could start fetching records from the top.
In short, your data storage approach should be primarily governed by how you want to fetch the data. It would even mean you will have to save the same data multiple times.
UPDATE
As discussed in the comments below, Table Service doesn't directly support Server Side Grouping. This is something that you would need to do on your own. What you could do is create a separate table to store the access counts. As and when the resources are accessed, you basically either insert a new record in that table or update the count for that resource in that table.
Assuming you're always interested in finding out resource access count within a date/time range, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks). The precision would depend on your reporting requirement. For example, if you want to maintain access counts by day then your precision would be a day.
RowKey: Resource name.
AccessCount: This field will constantly update as and when a resource is accessed.
LastAccessDateTime: This field will denote when a resource was last accessed.
For updating access counts, I would recommend that you make use of a background process. Basically in this approach, as a resource is accessed you add a message in a queue. This message will have resource name and date/time resource was last accessed. Then have a background process poll this queue and fetch messages. As the messages are received, you first get the current count and last access date/time for that resource. If no records are found, you simply insert a record in this table with count as 1. If a record is found then you compare the date/time from the table with the date/time sent in the message. If the date/time from the table is smaller than the date/time sent in the message, you update both count (increase that by 1) and last access date/time. If the date/time from the table is more than the date/time sent in the message, you only update the count.
Now to find most accessed resources in a time span, you simply query this table. Assuming there are limited number of resources (say in 100s), you can get this information from the table with at least 1 request. Since you're dealing with small amount of data, you can simply download this data on the client side and order it anyway you see fit. However to see the access details for a particular resource, you would have to fetch detailed data (1000 entities at a time).
Part of your brain might still be unconsciously trapped in relational-table design paradigms, I'm still getting to grips with that issue myself.
Rather than think of table storage as a database table (with the "query-ability" that goes with it) try visualizing it in more simple (dumb) terms.
A design problem I'm working on now is storing financial transaction data, and I want to know what the total $ amount of these transactions are. Because Azure table storage doesn't (yet?) offer aggregate functions I can't simply go .Sum(). To get around that I'm going to:
Sum the values of the transactions in my app before I pass them to azure.
I'll then pass that the result of the sum into azure as a separate piece of information, called RunningTotal.
Later on I can just return RunningTotal rather than pulling down all the transactions, and I can repeat the process by increment the value of RunningTotal each time i get new transactions.
Of course there are risks to this but the app is a personal one so the risk level is low and manageable, at least as a proof-of-concept.
Perhaps you can use a similar approach for the design of your system: compute useful values in advance. I'll almost be using table storage as a long-term cache rather than a database.

database for enterprise level using oracle - normalization and duplication

I am developing an enterprise application with an Oracle backend. I am designing a core part of the DB architecture now and im having some questions on it.
First and most important thing is, most of my tables needs to preserve old data. For example
Consider a table with the fields
Contract No, Contract Name, Contract Person, Contract Email
I have a records like
12, xxx, yyy, xxx#zzz.ccc
and some one modifies it to
12, xxx, zzz, xxx#zzz.ccc
at any point of time i need to display the new record while still have copy of the old record.
So what i thought was to put a duplicate record of the old data and update the fields that was changed and have a flag to keep track of active records with something like "is active" as 1.
The downside is that this creates redundancy in the table and seems like a bad design. But any other model seems unnecessarily complex and this seems cleaner to me. Also i dont see any performance issues having a duplicate record too. So please let me know if this is ok or am i missing something here.
Some times where there is a one to many relationship my assumption is to have a mapping table where i map the multiple entity in individual records by repeating master ID and changing child ID in each record. Is this a right way to do it or is there a better way to do it.
Is there a book on database best practices.
Thanks.
The database im dealing with is Oracle 11g on a two node RAC cluster
Also i dont see any performance issues having a duplicate record too.
Assume you have a row that, over time, has 15 updates to it. If you don't store any temporal data (if you don't store different versions of the row), you end up storing one row. If you do store temporal data, you end up storing 15 rows.
You also need more indexes, because the id number is no longer sufficient to identify a single row.
If you have only relatively small tables, you probably won't see any performance difference. (There will be one, but it probably won't be noticeable to users.) But a table that has 10 million rows will perform differently than a table that has 150 million rows. (15 versions per row, times 10 million rows.)
Some times where there is a one to many relationship my assumption is
to have a mapping table where i map the multiple entity in individual
records by repeating master ID and changing child ID in each record.
Is this a right way to do it or is there a better way to do it.
You probably need to know which child rows belong to which parent rows. So you need more than a single master id for the key. The master id alone doesn't tell you which version of that row in the parent table applies to a given child row.
Is there a book on database best practices.
There are books on temporal databases. The first one that I know of is Snodgrass's Developing Time-Oriented Database Applications in SQL. It's available in several formats, and it's free. It's also kind of old, but the information in it is important to understand if you're going to be building a temporal database. Also, think about reading Date's book Temporal Data and the Relational Model.
Wikipedia has an article that summarizes the ideas behind temporal databases.
Is normalization completely mandatory.
That's a meaningless question. You will have different issues with tables normalized to 2NF than you'll have with tables normalized to 5NF or 6NF.
I would keep the old/history records in a separate table. Create an upd/del trigger to populate your audit/history table for you, and keep only the most current data in your main table.
See here for an example. Many other similar examples exists in SO.

Resources