Business Data Preparation for Reporting and BI - reporting

I am doing some research about what the best possible state that data should be in so that reporting and BI analytics perform well but can be produced by business users from a set of various data collections which align with a business data glossary that I have worked through.
We have not chosen a specific BI tool but have been playing around with Power BI and Sisense
We have not decided on a data store technology to use for reporting purposes
Origin Data
Our business application that the data will originate from has a normalised SQL relational database. There are quite a few tables and joins to consider which work fine from an application perspective but I have recommended supplying the output of those queries as a flat denormalised set of data to increase redundancy and remove the joins entirely.
Business Data Glossary
As we go through defining the business data glossary, the number of columns increases but I do not anticipate there being any more than 100 columns per row as a complete reporting set of data. I wanted to ensure that each row of data is at a transactional depth (level 0) and that the roll up through the data would be done through aggregations by distinct key values and dimensional taxonomy.
Architecture
I want some advice around what a modern architecture looks like and what works for business users rather than users who are comfortable with SQL queries and a myriad of joins on a physical data model.
I read an article about setting up data flows for Power BI which looked like they type of thing I want to do from a data availability perspective but it doesn't advice on how the data should be stored and what type of database to use.
Data Sets
The data we have that needs to be reported on are transactions where level 0 is trade positions (individual transactions from either a local or counterparty entity), level 1 is reconciliations (relating local and counterparty entities and trade linking identifier) and level 2 would be where it can be rolled up by taxonomy like asset type or status.
The current data set size would be a snapshot of positions every business day so, its duplicated every day with a snapshot date applied. The reports would be able to move across dates and show changes over time.
Any advice would be greatly appreciated on how to tackle reporting and BI in 2020. Oooh, one last thing, there is the possibility that we won't be allowed to process this type of data in the public cloud, we have our own infrastructure which is on private cloud so, that might need to be a consideration. Thanks

Related

Where in the stack to best merge analytical data-warehouse data with data scraped+cached from third-party APIs?

Background information
We sell an API to users, that analyzes and presents corporate financial-portfolio data derived from public records.
We have an "analytical data warehouse" that contains all the raw data used to calculate the financial portfolios. This data warehouse is fed by an ETL pipeline, and so isn't "owned" by our API server per se. (E.g. the API server only has read-only permissions to the analytical data warehouse; the schema migrations for the data in the data warehouse live alongside the ETL pipeline rather than alongside the API server; etc.)
We also have a small document store (actually a Redis instance with persistence configured) that is owned by the API layer. The API layer runs various jobs to write into this store, and then queries data back as needed. You can think of this store as a shared persistent cache of various bits of the API layer's in-memory state. The API layer stores things like API-key blacklists in here.
Problem statement
All our input data is denominated in USD, and our calculations occur in USD. However, we give our customers the query-time option to convert the response just-in-time to another currency. We do this by having the API layer run a background job to scrape exchange-rate data, and then cache it in the document store. Individual API-layer nodes then do (in-memory-cached-with-TTL) fetches from this exchange-rates key in the store, whenever a query result needs to be translated into a specific currency.
At first, we thought that this unit conversion wasn't really "about" our data, just about the API's UX, and so we thought this was entirely an API-layer concern, where it made sense to store the exchange-rates data into our document store.
(Also, we noticed that, by not pre-converting our DB results into a specific currency on the DB side, the calculated results of a query for a particular portfolio became more cache-friendly; the way we're doing things, we can cache and reuse the portfolio query results between queries, even if the queries want the results in different currencies.)
But recently we've been expanding into also allowing partner clients to also execute complex data-science/Business Intelligence queries directly against our analytical data warehouse. And it turns out that they will also, often, need to do final exchange-rate conversions in their BI queries as well—despite there being no API layer involved here.
It seems like, to serve the needs of BI querying, the exchange-rate data "should" actually live in the analytical data warehouse alongside the financial data; and the ETL pipeline "should" be responsible for doing the API scraping required to fetch and feed in the exchange-rate data.
But this feels wrong: the exchange-rate data has a different lifecycle and integrity constraints than our financial data. The exchange rates are dirty and ephemeral point-in-time samples attained by scraping, whereas the financial data is a reliable historical event stream. The exchange rates get constantly updated/overwritten, while the financial data is append-only. Etc.
What is the best practice for serving the needs of analytical queries that need to access backend "application state" for "query result presentation" needs like this? Or am I wrong in thinking of this exchange-rate data as "application state" in the first place?
What I find interesting about your scenario is about when the exchange rate data is applicable.
In the case of the API, it's all about the realtime value in the other currency and it makes sense to have the most recent value in your API app scope (Redis).
However, I assume your analytical data warehouse has tables with purchases that were made at a certain time. In those cases, the current exchange rate is not really relevant to the value of the transaction.
This might mean that you want to store the exchange rate history in your warehouse or expand the "purchases" table to store the values in all the currencies at that moment.

Advice on Setup

I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.

what is more efficient in performance of hbase,multiple tables of same structure or a single table containing large set of data?

I had earlier created a project of storing daily data of particular entity in RDMS by creating a single table for each day and than storing data of that day in this table.
But now i want to shift my database from RDMS to HBase. So my question is whether I should create a single table and store data of all days in that table or I should use my earlier concept of creating a individual table for each day.I want to compare both cases on basis of performance of hbase.
Sorry if that question seems foolish to you.Thank you
As you mentioned there are 2 options
Option 1: Single table of all days data
Option 2: multiple tables
I would prefer Namespaces (introduced in version 0.96 is a very important feature) with option 2 if you have huge data for single day. This will support multi tenancy requirements also...
See Hbase Book
A namespace is a logical grouping of tables analogous to a database in relation database systems. This abstraction lays the groundwork for
upcoming multi-tenancy related features: Quota Management (HBASE-8410)
Restrict the amount of resources (ie regions, tables) a namespace can consume.
Namespace Security Administration (HBASE-9206) - Provide another level of security administration for tenants.
Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset of - RegionServers thus guaranteeing a course level of
isolation.
below are commands w.r.t. namespaces
alter_namespace, create_namespace, describe_namespace,
drop_namespace, list_namespace, list_namespace_tables
Advantage :
Even if you use column filters, since its less data(per day data), data retrieval will be fast for full table scan compared to single table approach(full scan on big table is costly)
If you want authentication and authorization on a specific table then it could also be achived.
Limitation : you will end up with multiple scripts to manage tables rather single script(option 1)
Note : In any afore mentioned options above,your rowkey design is very imp for better performance & prevent hotspoting.
For more details look at hbase-series

high volume data storage and processing

I am building a new application where I am expecting a high volume of geo location data something like a moving object sending geo coordinates every 5 seconds. This data needs to be stored in some database so that it can be used for tracking the moving object on a map anytime. So, I am expecting about 250 coordinates per moving object per route. And each object can run about 50 routes a day. and I have 900 such objects to track. SO, that brings to about 11.5 million geo coordinates to store per day. I have to store about one week of data at least in my database.
This data will be basically used for simple queries like find all the geocoordates for a particular object and a particular route. so, the query is not very complicated and this data will not be used for any analysis purpose.
SO, my question is should I just go with normal Oracle database like 12C distributed over two VMs or should I think about some big data technologies like NO SQL or hadoop?
One of the key requirement is to have high performance. Each query has to respond withing 1 second.
Since you know the volume of data (11.5 million) you can easily simulate the all your scenario in Oracle DB and test it well before.
My suggestions are you need to go for day level partitions & 2 sub partitions like objects & routs. All your business SQL has to hit right partitions always.
and also you might required to clear older days data. or Some sort of aggregation you can created with past days and delete your raw data would help.
its well doable 12C.

How do I ensure consistency of aggregates with high availability?

My team needs to find a solution to the following problem:
Our application allows users to view total sales for the enterprise, totals by product, totals by region, totals by region x product, totals by regions x division, etc. You get the idea. There are so many values that need to be aggregated to get many of those totals that they cannot be computed on the fly - we have to pre-aggregate them to provide decent response times, a process that takes about 5 minutes.
The problem, which we thought was a common one but can find no references to, is how to allow updates to various sales without shutting off the users. Also, the users cannot accept eventual consistency - if they drill down on a total of 12 they better see numbers that add up to 12. So we need Consistency + Availability.
The best solution we've come up with so far is to direct all queries to a redundant database, "B" (optimized for queries) while updates are directed to the primary database, "A". When we decide to spend the 5 minutes to update all the aggregates, we update database "C", which is yet another redundant database just like "B". Then, new user sessions get directed to "C", while existing user sessions continue to use "B". Eventually, warning anyone left using "B", we kill the sessions on "B" and re-aggregate there, swapping the roles of "B" and "C". Typical drain-stop scenario.
We are surprised that we cannot find any discussion of this and are concerned that we are over-engineering this problem or maybe it's not the problem we think it is. Any advice is greately appreciated.
This was an interesting problem so I thought about it on the train, and I came up with the idea of storing a timestamp for each row in the database that you aggregate over. (I think this technique has a name, but it escapes me and googling isn't finding it...)
The timestamp would indicate when this row was inserted. In addition:
-If rows can be updated, then you will have two 'versions' of the row at once, one more recent than the other.
-If rows can be deleted, then there will need to be a 'deleted version' row that specifies when it was deleted.
Now you can do things such as:
1) Say you update the aggregates at Jan 1 2000 midnight. You can have views of the table return the table's data as though it was Jan 1 2000 midnight, ignoring all inserts/updates/deletes more recent than that. Now the aggregates are as up to date as the data in the view AND you can keep adding data to the underlying table.
2) I don't know how feasible/easy to guarantee it's reliable this would be, but you could have 'differentially computed aggregates' where on Jan 2 2000 midnight, you take the aggregates of Jan 1 2000 midnight and update them only with the data that has been changed since that time - saving you from recomputing so much historical data. (Of course, it gets hairier once you consider rows being updated or deleted that are older than 24 hours)
3) Whenever you bring your aggregates up to date, you can merge updated and deleted rows with their older version and get rid of the older version, so you only have to keep duplicates of rows around when you need them to separate rows that have been aggregated and rows that aren't (this also means that, for instance, if all your aggregates run at once, and you update a row three times in quick succession, you only need to keep the most recent update-indicating row)
If updates cannot be computed on the fly, then caching of results sets as you are doing in another database helps solve the issue of availability with faster response times.
For consistency, you may be able to make use of some form of transaction isolation. For example, MySQL supports a number of different transaction levels, of which REPEATABLE READ may go close to providing you with some consistency in a single transaction. If a transaction can be left open for multiple requests as the users drill down to see the data, they effectively see a snapshot of the database state as of the first request.
In a more generic sense, you're just after a handle which to the data which is provided by the client to indicate a consistent set. As in Patashu's answer, the handle for a client requesting a set of aggregates could be time based. The first stage of client interaction would be to get a handle to the latest aggregate data, eg the current time. If would then pass that handle with each request. As requests are made of the server, it uses the handle to determine which set of aggregate data to return. Rather than having both server "B" and "C", all aggregate data could be stored in server "B", with all aggregate data containing the handle information. This then allows requests to a single server for aggregate data both new and old. At some point, old aggregate data could be purged from "B".
Perhaps a search on transaction isolation will turn up more results for discussion on consistency.
I think you're looking for Data Warehousing concepts
In computing, a data warehouse or enterprise data warehouse (DW, DWH,
or EDW) is a database used for reporting and data analysis. It is a
central repository of data which is created by integrating data from
one or more disparate sources. Data warehouses store current as well
as historical data and are used for creating trending reports for
senior management reporting such as annual and quarterly comparisons.
...
Unlike the ETL-based data warehouse, the integrated source data
systems and the data warehouse are all integrated since there is no
transformation of dimensional or reference data. This integrated data
warehouse architecture supports the drill down from the aggregate data
of the data warehouse to the transactional data of the integrated
source data systems.

Resources