I have created two tables in Lake Formation. For the users, I would like to create a view by joining these two tables. Is that possible in Lake Formation?
You'll have to create view in your query engines like Amazon Athena (SQL). Note that this view is not usable in another engine like Spark.
Related
I have Iceberg tables in AWS data catalog which I want to use to create dashboards in AWS QuickSight. The idea is to set a date paremter in QuickSight and then to be able to use it with Iceberg time travel feature. I.e. I'd like QuickSight to filter the data as of specific date using Iceberg capability to execute queries "as of timestamp" (e.g. select * from table FOR TIMESTAMP AS OF (timestamp '2022-12-01 22:00:00').
My questions are:
Does Quicksight support Iceberg tables as a data source?
Is it possible to use the time travel feature of Iceberg tables in Quicksight when writing custom sql queries for the data source?
Is it possible to use Quicksight parameter with Iceberg time travel?
If this is possible, it would be extremely powerful combination Iceberge timetravel + Quicksight dashboards. If it is not possible what is the best alternatives, assuming that my data are in Iceberg tables.
It seems that Quicksight can work with Icerberg tables as a data source, but I can't figure out if quciksight paramters somehow can be used im time travel for Iceberg tables.
You can use Iceberg tables as a data source with the Athena connector, but you can't use the QuickSight parameters for time travel. QuickSight parameters don't interact with the datasets, they only apply to the following:
Calculated fields (except for multivalue parameters)
Filters
Dashboard and analysis URLs
Actions
Titles and descriptions throughout an analysis
https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html
https://docs.aws.amazon.com/quicksight/latest/user/parameters-in-quicksight.html
I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.
I am evaluating snowflake for reporting usecase. I am considering snowpipe for ETL. Data is ingested from S3. Data in S3 contains information about user sessions captured at regular interval. In Snowflake, I want to stored these data aggregate. As per documentation snowflake supports only basic transformation and doesn't support group by and join during copying data from S3 staging to tables in snowflake.
I am new to ETL and snowflake. One way i was thinking is to load raw detailed data from staging to temporary table in snowflake. Then run aggregations (group by and join) on temporary table to load data into final fact tables. Is this the correct approach for implementing complex tranformations?
Temporary tables in Snowflake only stick around for the session that they have been created in. This means that you won't be able to point a Snowpipe to it.
Instead of a temporary table, point Snowflake to a transient table to store the raw data and then truncate the table after some period of time. This will reduce costs. Personally, I'd keep the data in the transient table for as long as possible provided that it is not too cost prohibitive. This is to account for potentially late data etc.
Yes, your aproach looks good to me.
Snowpipe loads your data continously from S3 to Snowflake and within Snowflake you use
Views
Tables and Stored Procedures
to transform the data and load it into your final fact table.
While playing with AWS Amplify, I was looking for the most appropriate architecture pattern, which would allow me to have the following:
scalable and reliable DB to handle CRUD operations (DynamoDB rocks here)
complex querying and filtering, where data access patterns are not strictly defined or unknown (Elasticsearch wins here)
So obviously I am hooked by the idea of streaming DynamoDB data to Elasticsearch for queries and keeping DynamoDB with read/write operations only.
What are the pros and cons of this architecture?
[UPDATE] Here is an example of a use case:
Case Management application: 2-5 core tables with information about cases and tasks, 30-50 fields per table, up to 1mil records each.
Users need to be able to run complex queries agains ALL fields and tables, so it's hard to pre-define specific access patterns.
The idea is to stream ALL data from DynamoDB to Elasticsearch, so ALL of the QUERIES goto Elasticsearch.
All of the read/write operations would goto DynamoDB as a primary source of data.
I want to be able to store an arbitrary C# struct with some variables on the Azure SQL server and retrieve it later into a similar struct. How can I do this without knowing the structure of the database?
SQL Azure is very similar to SQL Server, as you would build your schema, tables, rows, etc. the same way. If you wanted a schemaless approach to data types, you'd need to serialize your objects to some generic column, along with a Type column. Or use a Property table approach.
Alternatively, Windows Azure has a schema-free storage construct, the Windows Azure Table. Each row may contain different data. You'd just need some mechanism for determining the type of data you wrote (maybe one of the row properties, perhaps). Azure Tables are lightweight compared to SQL Azure, in that it's not a relational database. Each row is referenced by a Partition Key and Row Key (the pair being essentially a composite key).
So... assuming you don't have complex search / index requirements, you should be able to use Azure Tables to accomplish what you're trying to do.
This blog post goes over the basics of both SQL Azure and Azure Tables.
There are also examples of using Azure Tables in the Platform Training Kit.