AWS Glue - disabling bookmarks for some of sources in the job - etl

I've got a data warehouse with star pattern - fact table and multiple dimensions around that. They are connected by foreign keys.
I've got two AWS Glue jobs:
Populates dimensions (run on-demand, they doesn't change often)
Populates fact table (should be run even every hour to get fresh fact data in warehouse)
So the situation is: I've got filled-in dimension tables using first job. In second job I need to get only fresh data for fact table, find foreign keys for each record in dimension table and persist new row in fact table.
Problem is, that when using bookmarks, AWS Glue thinks that since dimension tables didn't change since last run, nothing is there and inserts null as foreign keys.
I tried to disable bookmarking by removing transformation_ctx from the generated script, but it didn't work.
From this:
dimension_node1647201451763 = glueContext.create_dynamic_frame.from_catalog(
database="foobar-staging",
table_name="dimension",
transformation_ctx="dimension_node1647201451763",
)
I did that:
foobaritem_node1647201451763 = glueContext.create_dynamic_frame.from_catalog(
database="foobar-staging",
table_name="foobar_item",
)
But still those record were not found.
Only solution that I can imagine is disabling bookmarks completely and then add "not exists" checks for all records processed, which would prevent duplicates.

Related

CDC strategy for multiple staging tables

I'm implementing a Data Mart following the Kimball methodology and I have a challenge with applying deltas from multiple source tables against a single target dimension.
Here's an example of the incoming source data:
STG_APPLICATION
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, ...
1, FOOBAR, 20/10/2018, MD5_XXX
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, ...
1, SUBMITTED, "APP WAS SUBMITTED", MD5_YYY
Each of these tables (there are several others) represent a normalised version of the source data i.e. a single application can have one or more statuses associated with it.
Now then, because we only get a full alpha for these tables we have to do a snapshot merge, i.e. apply a full outer join on the current day set of records against the previous day set of records for each individual table. This is computed by comparing the CDC_HASH (a concat of all source columns). The result of this comparison is stored in a delta table as follows:
STG_APPLICATION_DELTA
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, CDC_STATUS ...
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, CDC_STATUS...
1, AWARDED, "APP WAS AWARDED", MD5_YYY, NEW
So in this example, the first table, STG_APPLICATION did not generate a delta record as the attributes pertaining to that table did not change between daily loads. However, the associated table, STG_APPLICATION_STATUS, did calculate a delta, i.e. one or more fields have changed since the last load. This is highlighted by the CDC_STATUS which identifies it as a new record to insert.
The problem now of course is how to correctly handle this situation when loading the target dimension? For example:
DIM_APPLICATION
ID, APPLICATION_ID, APP_NAME, APP_START_DATE, APP_STATUS_CODE, FROM_DATE, TO_DATE
1, 1, FOOBAR, 20/10/2018, SUBMITTED, 20/10/2018, 12/04/2019
2, 1, NULL, NULL, NULL, AWARDED, 13/04/2019, 99/99/9999
This shows the first record - based on these two staging tables being joined - and the second record which is meant to reflect an updated version of the record. However, as previously illustrated, my Delta tables are only partially populated, and therefore I am unable to correctly update the dimension as shown here.
Logically, I understand that I need to be able to include all fields that are used by the dimension as part of my delta calculation, so that I have a copy of a full record when updating the dimension, but I'm not sure of the best way to implement this in my staging area. As shown already, I currently only have independent staging tables, each of which calculate their delta separately.
Please can somebody advise on the best way to handle this? I'm scrutinized Kimball's books on this but to no avail. And I've equally found no suitable answer on any online forums. This is a common problem so I'm sure there exists a suitable architectural pattern to resolve this.
You will need to either compare on joined records or lookup the current dimension values.
If the amount of (unchanged) data is not excessive, you could join the full snapshots of STG_APPLICATION and STG_APPLICATION_STATUS together on APP_ID until they resemble the dimension record column-wise and store those in a separate table with their CDC hash to use as previous day. You then take the deltas at this level and send the (complete) changed records as updates to the dimension.
If the amount of records in the daily update makes it impractical to join the full tables, you can take the deltas and full outer join them as you do now. Then you look up the current dimension record for this APP_ID and fill in all empty fields in the delta record. The completed record is then sent as an update to the dimension.
This solution requires less storage but seems more fragile, especially if multiple changes are possible within a day. If there are many changes, performance may also suffer. For a handful of changes in millions of records, it should be more efficient.

Oracle - build dimension from a file based data source

I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.

Filtering Quotes by InventTable

I'm trying to build a report in AX 2009 (SP1, currently rollup 6) with a primary data source of the SalesQuotationLine table. Due to how our inventory is structured, I need to apply a filter that shows only certain categories of items (in this case, non-service items as defined in the InventTable). However, it seems that there is a problem in the link between the SalesQuotationLine and InventTable such that only two specific items will ever display.
We have tested this against the Sales Quotation Details screen as well, with the same results. Executing a query such as this:
...will only show quotes that have one of the specific items mentioned earlier. If we change the Item Type to something else (for example to Item), the result is an empty set. We are also getting this issue on one of our secondary test servers, which for all intents is a fresh install.
There doesn't seem to be any issues with the data mapping from the one table to the other, and we are not experiencing this issue with any other table set. Is this a real issue, or am I just missing something?
After analyzing the results from a SQL Profile run during the execution of the query, it seems the issue was a system bug. When selecting a table to join to the SalesQuotationLines, you have two options: 'Items' and 'Items (Item Number)'. Regardless of which table you select the query executes with, it joins the InventTable with the relation "SalesQuotationLines.ProjTransCode = InventTable.ItemId".
After comparing the table to other layers in the system, I found the following block of code removed from the createLine method (in the SYP layer):
if (this.ProjTransType == QuotationProjTransType::Item)
{
this.ProjTransCode = this.ItemId;
}
Since the ProjTransCode is no longer being populated, the join does not work except on certain quote lines that do have the ProjTransCode populated.
In addition, there is no directly defined relation to the InventTable - the link is only maintained via an Extended Data Type that is used on the SalesQuotationLine.ItemId field. Adding this relation in manually solved the problem.

Teradata: How to design table to be normalized with many foreign key columns?

I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.

Can you sort a GET on a Cassandra column family by the Timestamp value created for each column entry, rather than the column Keys?

Basically I have a 'thread line' where new threads are made and a TimeUUID is used as a key. Which obviously provides sorting of a new thread quite easily, espically when say making a query of the latest 20 threads etc.
My problem is that when a new 'post' is made to a thread I want to be able to 'bump' that thread to the front of the 'thread line' which is where the problem comes in, how do I basically make this happen so I can still make queries that can still be selected in the right order without providing any kind of duplicates etc.
The only way I can see this working is if rather than a column family sorting via a TimeUUID I need the column family to sort via the insertion Timestamp, therefore I can use the unique thread IDs for column keys and retrieve these in the order they are inserted or reinserted rather than by TimeUUID? Is this possible or am I missing a simple trick that allows for this? As far as I know you have to set a particular comparitor or otherwise it defaults to bytes?
Columns within a row are always sorted by name with the given comparator. You cannot sort by timestamp or value or anything else, or Cassandra would not be able to merge multiple updates to the same column correctly.
As to your use case, I can think of two options.
The most similar to what you are doing now would be to create a second columnfamily, ThreadMostRecentPosts, with timeuuid columns (you said "keys" but it sounds like you mean "columns"). When a new post arrives, delete the old most-recent column and add a new one.
This has two problems:
The unit of replication is the row, so having this grow indefinitely could be problematic. (Using expiring columns to age out no-longer-relevant thread information might help.)
You need a lock manager so that multiple posts to the same thread don't race and possibly leave multiple entries in this row.
I would suggest instead creating a row per day (for instance), whose columns are the thread IDs and whose values are the most recent post. Adding a new post just updates the value in that column; no delete/re-add is done, so the race is not a problem anymore. You don't get sorting for free anymore but that's okay because you're limiting it to a small enough set that you can do that sort in memory (say, yesterday's threads and today's).
(Finally, I would add that I can say from experience that having a cutoff past which old threads don't get bumped to the front by a new reply is a Good Thing.)

Resources