Where in the stack to best merge analytical data-warehouse data with data scraped+cached from third-party APIs? - caching

Background information
We sell an API to users, that analyzes and presents corporate financial-portfolio data derived from public records.
We have an "analytical data warehouse" that contains all the raw data used to calculate the financial portfolios. This data warehouse is fed by an ETL pipeline, and so isn't "owned" by our API server per se. (E.g. the API server only has read-only permissions to the analytical data warehouse; the schema migrations for the data in the data warehouse live alongside the ETL pipeline rather than alongside the API server; etc.)
We also have a small document store (actually a Redis instance with persistence configured) that is owned by the API layer. The API layer runs various jobs to write into this store, and then queries data back as needed. You can think of this store as a shared persistent cache of various bits of the API layer's in-memory state. The API layer stores things like API-key blacklists in here.
Problem statement
All our input data is denominated in USD, and our calculations occur in USD. However, we give our customers the query-time option to convert the response just-in-time to another currency. We do this by having the API layer run a background job to scrape exchange-rate data, and then cache it in the document store. Individual API-layer nodes then do (in-memory-cached-with-TTL) fetches from this exchange-rates key in the store, whenever a query result needs to be translated into a specific currency.
At first, we thought that this unit conversion wasn't really "about" our data, just about the API's UX, and so we thought this was entirely an API-layer concern, where it made sense to store the exchange-rates data into our document store.
(Also, we noticed that, by not pre-converting our DB results into a specific currency on the DB side, the calculated results of a query for a particular portfolio became more cache-friendly; the way we're doing things, we can cache and reuse the portfolio query results between queries, even if the queries want the results in different currencies.)
But recently we've been expanding into also allowing partner clients to also execute complex data-science/Business Intelligence queries directly against our analytical data warehouse. And it turns out that they will also, often, need to do final exchange-rate conversions in their BI queries as well—despite there being no API layer involved here.
It seems like, to serve the needs of BI querying, the exchange-rate data "should" actually live in the analytical data warehouse alongside the financial data; and the ETL pipeline "should" be responsible for doing the API scraping required to fetch and feed in the exchange-rate data.
But this feels wrong: the exchange-rate data has a different lifecycle and integrity constraints than our financial data. The exchange rates are dirty and ephemeral point-in-time samples attained by scraping, whereas the financial data is a reliable historical event stream. The exchange rates get constantly updated/overwritten, while the financial data is append-only. Etc.
What is the best practice for serving the needs of analytical queries that need to access backend "application state" for "query result presentation" needs like this? Or am I wrong in thinking of this exchange-rate data as "application state" in the first place?

What I find interesting about your scenario is about when the exchange rate data is applicable.
In the case of the API, it's all about the realtime value in the other currency and it makes sense to have the most recent value in your API app scope (Redis).
However, I assume your analytical data warehouse has tables with purchases that were made at a certain time. In those cases, the current exchange rate is not really relevant to the value of the transaction.
This might mean that you want to store the exchange rate history in your warehouse or expand the "purchases" table to store the values in all the currencies at that moment.

Related

Advice on Setup

I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.

Business Data Preparation for Reporting and BI

I am doing some research about what the best possible state that data should be in so that reporting and BI analytics perform well but can be produced by business users from a set of various data collections which align with a business data glossary that I have worked through.
We have not chosen a specific BI tool but have been playing around with Power BI and Sisense
We have not decided on a data store technology to use for reporting purposes
Origin Data
Our business application that the data will originate from has a normalised SQL relational database. There are quite a few tables and joins to consider which work fine from an application perspective but I have recommended supplying the output of those queries as a flat denormalised set of data to increase redundancy and remove the joins entirely.
Business Data Glossary
As we go through defining the business data glossary, the number of columns increases but I do not anticipate there being any more than 100 columns per row as a complete reporting set of data. I wanted to ensure that each row of data is at a transactional depth (level 0) and that the roll up through the data would be done through aggregations by distinct key values and dimensional taxonomy.
Architecture
I want some advice around what a modern architecture looks like and what works for business users rather than users who are comfortable with SQL queries and a myriad of joins on a physical data model.
I read an article about setting up data flows for Power BI which looked like they type of thing I want to do from a data availability perspective but it doesn't advice on how the data should be stored and what type of database to use.
Data Sets
The data we have that needs to be reported on are transactions where level 0 is trade positions (individual transactions from either a local or counterparty entity), level 1 is reconciliations (relating local and counterparty entities and trade linking identifier) and level 2 would be where it can be rolled up by taxonomy like asset type or status.
The current data set size would be a snapshot of positions every business day so, its duplicated every day with a snapshot date applied. The reports would be able to move across dates and show changes over time.
Any advice would be greatly appreciated on how to tackle reporting and BI in 2020. Oooh, one last thing, there is the possibility that we won't be allowed to process this type of data in the public cloud, we have our own infrastructure which is on private cloud so, that might need to be a consideration. Thanks

Oracle Materialized View for sensory data transfer

In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.
Prologue
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.
Problem definition
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Further assumptions
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
the number of clients might change
you want to be able to add clients without increasing the number of methods of accessing data
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Architecture diagram
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Applying assumed architecture to materialized views
Materialized views are a persisted query. Therefore you have two choices:
Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
There is a single service that accepts incoming data
There is a single service that puts all the incoming data into a single place
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Consequences of changes
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Reimagining architecture
Here we need to work out what technologies aren't going to change.
Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.

How to implement Data Lineage on Hadoop?

We are implementing few business flows in financial area. The requirement (unfortunately, not very specific) from the regulatory is to have a data lineage for auditing purpose.
The flow contains 2 parts: synchronous and asynchronous. The syncronous part is a payment attempt containing bunch of info about point of sale, the customer and the goods. The asynchronous part is a batch process that feeds the credit assessment data model with a newly-calculated portion of variables on an hourly basis. The variables might include some aggregations like balances and links to historical transactions.
For calculating the asynchronous part we ingest the data from multiple relational DBs and store them in HDFS in a raw format (rows from tables in csv format).
When storing the data on HDFS is done a job based on Spring XD that calculates some aggregations and produces the data for the synchronous part is triggered.
We have relational data, raw data on HDFS and MapReduce jobs relying on POJOs that describe the relevant semantics and the transformations implemented in SpringXD.
So, the question is how to handle the auditing in the scenario described above?
We need at any point in time to be able to explain why a specific decision was made and also be able to explain how each and every variable used in the policy (synchronous or near-real-time flow) was calculated.
I looked at existing Hadoop stack and it looks like currently no tool could provide with a good enterprise-ready auditing capabilities.
My thinking is to start with custome implementation that includes>
A business glossary with all the business terms
Operational and technical metadata - logging transformation execution for each entry into a separate store.
log changes to a business logic (use data from version control where the business rules and transformations are kept).
Any advice or sharing your experience would be greatly appreciated!
Currently Cloudera sets the industry standard for Data Lineage/Data Governance in the big data space.
Glossary, metadata and historically run (versions of) queries can all be facilitated.
I do realize some of this may not have been in place when you asked the question, but it certainly is now.
Disclaimer: I am an employee of Cloudera

Microservices - Maintaining Multiple Data stores, initial data load etc

On aspects of granulatiry of mictoservices have read about the 2 pizza rule, services that can be developed in 2 weeks etc. When the case studies of amazon, nelflix, gilt are read we hear about 100s of services. While the service granularity does make sense, what is still not clear to me is about the data stores of each of these microservices. Will there not be just too many data stores if each of the services store/maintain their own data ?? It might be the same logical entity like a product, customer etc that is sliced & the relevant portion/attributes stored/maintained by a corresponding microservice. There could be a service that maintains basic customer information, another that maintains the additional customer information like say his subscription information or his interests etc.
Couple of questions that come to mind around the data stores
Will this not be a huge maintenance issue in terms of backups,
restores etc?
How is the initial data populated into these stores ? Are there any best practices around this ? Organisations are bound to have huge volumes of customer or product data & they will most likely be mastered in other systems.
How does this approach of multiple data stores impact the 'omni-channel' approach where it implies getting a single view of all data? Organizations might have had data consolidation initiatives going on to achieve the same
Edit: Edited the subject a bit
1.Will this not be a huge maintenance issue in terms of backups, restores etc?
From your view yes it will. I mean at the end of day you will not have just one database server to backup but tens or hundreds of them. But mostly people -at least that is what we do - is using a cloud database service to get rid of all these maintenance effort.
2.How is the initial data populated into these stores ? Are there any best practices around this ? Organisations are bound to have huge volumes of customer or product data & they will most likely be mastered in other systems.
I am not sure if there is a best way but we created a client to read the data from legacy system then convert and split it into the parts for each microservices and push them to those microservices by consuming their services. We used message queues to to be sure about health of migration.
3.How does this approach of multiple data stores impact the 'omni-channel' approach where it implies getting a single view of all data? Organizations might have had data consolidation initiatives going on to achieve the same.
Well I don't know what "omni-channel" is so I can't answer that.
Lastly you were mentioning about logical entities shared between services. The real hardest part about implementing microservices is defining what each service will provide. And while doing that you should carefully examine data needs for each services and those services should share as little as possible like only entity ids etc. At least that is what we are doing.

Resources