Using Iceberg tables and time travel feature in AWS Quicksight - amazon-quicksight

I have Iceberg tables in AWS data catalog which I want to use to create dashboards in AWS QuickSight. The idea is to set a date paremter in QuickSight and then to be able to use it with Iceberg time travel feature. I.e. I'd like QuickSight to filter the data as of specific date using Iceberg capability to execute queries "as of timestamp" (e.g. select * from table FOR TIMESTAMP AS OF (timestamp '2022-12-01 22:00:00').
My questions are:
Does Quicksight support Iceberg tables as a data source?
Is it possible to use the time travel feature of Iceberg tables in Quicksight when writing custom sql queries for the data source?
Is it possible to use Quicksight parameter with Iceberg time travel?
If this is possible, it would be extremely powerful combination Iceberge timetravel + Quicksight dashboards. If it is not possible what is the best alternatives, assuming that my data are in Iceberg tables.
It seems that Quicksight can work with Icerberg tables as a data source, but I can't figure out if quciksight paramters somehow can be used im time travel for Iceberg tables.

You can use Iceberg tables as a data source with the Athena connector, but you can't use the QuickSight parameters for time travel. QuickSight parameters don't interact with the datasets, they only apply to the following:
Calculated fields (except for multivalue parameters)
Filters
Dashboard and analysis URLs
Actions
Titles and descriptions throughout an analysis
https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html
https://docs.aws.amazon.com/quicksight/latest/user/parameters-in-quicksight.html

Related

Advice on Setup

I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.

Filter a Data Source from a Different Data Source

I have two chart tables both with different data sources. I want one table to act as the filter to the other table.
Here is the problem...
I tried a custom query for my data source which used the email parameter to filter the data source.
The problem is every time a user changes a filter on any page a query is executed in BigQuery, slowing the results and exponentially increasing my BigQuery monthly charges.
I tried blending the two tables.
The problem is the blended data feature only allows for 10 dimensions to be added to the resulting blended data source and is very slow.
I tried creating a control filter using a custom field on the "location" column on each table sharing the same "Field Id".
The problem is that the results table returns all the stores until you click on a location in the control list. And I cannot let a user see other locations.
Here is a link to a data studio sample report you can clearly see what I am trying to do.
https://datastudio.google.com/reporting/dd33be45-ab13-4881-8a3b-cabafa8c0dbb
Thanks
One solution which i can recommend to over come your first challenge, i.e. High cost. You can customize cost by using GCP-Memorystore, depending on frequency of data that is getting updated.
Moreover, Bigquery also cashes data for a query if you are not using Wild cards on tables and Time partitioned tables. So try to customize your solution over analysis cost if it is feasible over your solution. Bigquery Partition and Clusting may also help you in reducing BQ analysis cost.

Which Nifi processor to use for RDBMS Extract

i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
Thanks!
I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link

what is more efficient in performance of hbase,multiple tables of same structure or a single table containing large set of data?

I had earlier created a project of storing daily data of particular entity in RDMS by creating a single table for each day and than storing data of that day in this table.
But now i want to shift my database from RDMS to HBase. So my question is whether I should create a single table and store data of all days in that table or I should use my earlier concept of creating a individual table for each day.I want to compare both cases on basis of performance of hbase.
Sorry if that question seems foolish to you.Thank you
As you mentioned there are 2 options
Option 1: Single table of all days data
Option 2: multiple tables
I would prefer Namespaces (introduced in version 0.96 is a very important feature) with option 2 if you have huge data for single day. This will support multi tenancy requirements also...
See Hbase Book
A namespace is a logical grouping of tables analogous to a database in relation database systems. This abstraction lays the groundwork for
upcoming multi-tenancy related features: Quota Management (HBASE-8410)
Restrict the amount of resources (ie regions, tables) a namespace can consume.
Namespace Security Administration (HBASE-9206) - Provide another level of security administration for tenants.
Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset of - RegionServers thus guaranteeing a course level of
isolation.
below are commands w.r.t. namespaces
alter_namespace, create_namespace, describe_namespace,
drop_namespace, list_namespace, list_namespace_tables
Advantage :
Even if you use column filters, since its less data(per day data), data retrieval will be fast for full table scan compared to single table approach(full scan on big table is costly)
If you want authentication and authorization on a specific table then it could also be achived.
Limitation : you will end up with multiple scripts to manage tables rather single script(option 1)
Note : In any afore mentioned options above,your rowkey design is very imp for better performance & prevent hotspoting.
For more details look at hbase-series

Loading of data with user conditions power BI

I am developing a report application in Power BI desktop version. I successfully created a dataset using a query and applying the filters on result data. But Now i have to get data from database in real time with user filters i.e. dataset would be created on the basis of some inputs given by users. We need this as database size is quite huge and we can not load the data then apply filters and create reports.
Same can easily be done in Dot Net application but we have to achieve this on Power Bi.
Please suggest if this can be done.
I would use the Query Parameters feature for this. You add them in the Edit Queries window, from Home / Manage Parameters, then you can use them in Calculated columns or replacing a "hard coded" filter.
There's a detailed write up in a recent blog post:
https://powerbi.microsoft.com/de-de/blog/deep-dive-into-query-parameters-and-power-bi-templates/

Resources