Emulating GC with Doctrine

Emulating GC with Doctrine - doctrine

My idea is to to use the database (through ORM) as carefree as you deal with simple objects in a GC environment.
The basic idea is to use cascade-remove on most of the connections between tables, and just skip the failing steps. With a simple example:
Country (id, name)
1, UK
2, Germany
City (id, name, country)
1, London, 1
2, Brighton, 1
3, Schweinfurt, 2
This way when you remove City(3) then the removal will cascade to the Country(2) and it gets removed too (as it is not referenced any more).
If on the other hand you want to remove City(2) then the removal of Country(1) will fail - as it is still referenced by City(1) - and only the City entity itself would be removed.
The problem is that Doctrine will roll back the whole transaction so neither the Country nor the City is deleted. Is there a way to change this behavior?

Related

Avoiding duplicates in Elasticsearch, discriminating over different IDs?

I'm fairly new to Elasticsearch, I have been trying to come with the idea of merging data which belongs to the same entity, but not always having the same columns.
For example, let's say I'm getting cars from database1 and database2.
In database1 I have the plate, the weight and the color, and for example VIN number
and in database2 I have the color, the date of creation and, the VIN number again
and now in database3, I might for example have, color, date of creation and plate.
If I were to add a car A from database1, and then the same car A from database 3, ID (plate) would collide, but if I was to do with database2, it wouldn't collide and would create 2 different documents of the same instance.
I'd like to know if there's a way maybe use "multiple optional IDs", being able to use 1, both of them, or the other, depending on the data I add each time, as different databases would have different fields.
Thanks.

CDC strategy for multiple staging tables

I'm implementing a Data Mart following the Kimball methodology and I have a challenge with applying deltas from multiple source tables against a single target dimension.
Here's an example of the incoming source data:
STG_APPLICATION
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, ...
1, FOOBAR, 20/10/2018, MD5_XXX
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, ...
1, SUBMITTED, "APP WAS SUBMITTED", MD5_YYY
Each of these tables (there are several others) represent a normalised version of the source data i.e. a single application can have one or more statuses associated with it.
Now then, because we only get a full alpha for these tables we have to do a snapshot merge, i.e. apply a full outer join on the current day set of records against the previous day set of records for each individual table. This is computed by comparing the CDC_HASH (a concat of all source columns). The result of this comparison is stored in a delta table as follows:
STG_APPLICATION_DELTA
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, CDC_STATUS ...
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, CDC_STATUS...
1, AWARDED, "APP WAS AWARDED", MD5_YYY, NEW
So in this example, the first table, STG_APPLICATION did not generate a delta record as the attributes pertaining to that table did not change between daily loads. However, the associated table, STG_APPLICATION_STATUS, did calculate a delta, i.e. one or more fields have changed since the last load. This is highlighted by the CDC_STATUS which identifies it as a new record to insert.
The problem now of course is how to correctly handle this situation when loading the target dimension? For example:
DIM_APPLICATION
ID, APPLICATION_ID, APP_NAME, APP_START_DATE, APP_STATUS_CODE, FROM_DATE, TO_DATE
1, 1, FOOBAR, 20/10/2018, SUBMITTED, 20/10/2018, 12/04/2019
2, 1, NULL, NULL, NULL, AWARDED, 13/04/2019, 99/99/9999
This shows the first record - based on these two staging tables being joined - and the second record which is meant to reflect an updated version of the record. However, as previously illustrated, my Delta tables are only partially populated, and therefore I am unable to correctly update the dimension as shown here.
Logically, I understand that I need to be able to include all fields that are used by the dimension as part of my delta calculation, so that I have a copy of a full record when updating the dimension, but I'm not sure of the best way to implement this in my staging area. As shown already, I currently only have independent staging tables, each of which calculate their delta separately.
Please can somebody advise on the best way to handle this? I'm scrutinized Kimball's books on this but to no avail. And I've equally found no suitable answer on any online forums. This is a common problem so I'm sure there exists a suitable architectural pattern to resolve this.

You will need to either compare on joined records or lookup the current dimension values.
If the amount of (unchanged) data is not excessive, you could join the full snapshots of STG_APPLICATION and STG_APPLICATION_STATUS together on APP_ID until they resemble the dimension record column-wise and store those in a separate table with their CDC hash to use as previous day. You then take the deltas at this level and send the (complete) changed records as updates to the dimension.
If the amount of records in the daily update makes it impractical to join the full tables, you can take the deltas and full outer join them as you do now. Then you look up the current dimension record for this APP_ID and fill in all empty fields in the delta record. The completed record is then sent as an update to the dimension.
This solution requires less storage but seems more fragile, especially if multiple changes are possible within a day. If there are many changes, performance may also suffer. For a handful of changes in millions of records, it should be more efficient.

Fact table organization

I am participating in creation of reporting software which utilizes Kimball star schema methodology. Entire team (including me) hasn't worked with this technology so we are new in this.
There are couple of dimension and fact tables in or system so far. For example:
- DIM_Customer (dimension table for customers)
- DIM_BusinessUnit (dimension table for business units)
- FT_Transaction (fact table, granularity per transaction)
- FT_Customer (fact table for customer, customer id and as on date are in composite PK)
This is current structure of FT_Customer:
- customer_id # (customer id, part of composite PK)
- as_on_date # (date of observation, part of composite PK)
- waic (KPI)
- wat (KPI)
- waddl (KPI)
- wadtp (KPI)
- aging_bucket_current (KPI)
- aging_bucket_1_to_10 (KPI)
- aging_bucket_11_to_25 (KPI)
- ... ...
Fields waic, wat, waddl and wadtp are related to delay in transaction payment. These fields are calculated by aggregation query against FT_Transaction table grouped by customer_id and as_on_date.
Fields aging_bucket_current, aging_bucket_1_to_10 and aging_bucket_11_to_25 contains number of transactions categorized by delay in payment. For example aging_bucket_current contains number of transactions that are payed on time, aging_bucket_1_to_10 contains number of transactions that are payed with 1 to 10 days delay ...
This structure is used for report generation from PHP web application as well as Cognos studio. We discussed about restructuring FT_Customer table in order that make it more usable for external systems like Cognos.
New proposed structure of FT_Customer:
- customer_id # (customer id, part of composite PK)
- as_on_date # (date of observation, part of composite PK)
- kpi_id # (id of KPI, foreign key that points to DIM_KPI dimension table, part of composite PK)
- kpi_value (value KPI)
- ... ...
For this proposal we will have additional dimension table DIM_KPI:
- kpi_id #
- title
This table will contain all KPIs (wat, waic, waddl, aging buckets ...).
Second structure of FT_Customer will obviously have more rows than current structure.
Which structure of FT_Customer is more universal?
Is it acceptable to keep both structures in separate tables? This will obviously put additional burden to ETL layer because some of work will be done twice, but on the other side it will make easier generation of various reports.
Thanks in advance for suggenstions.

The 1st structure seems to be more natural and common to me. However, the 2nd one is more flexible, because it supports adding new KPIs without changing the structure of the fact table.
If different ways of accessing data actually require different structures, there is nothing wrong about having two fact tables with the same data, as long as:
both tables are always loaded together (not necessarily in parallel, but within the same data load job/workflow),
measures calculation are consistent (reuse the logic if possible).
You should test the results for any data inconsistencies.

Before you proceed, go buy yourself Agile Data Warehouse Design and read it thoroughly. It's pretty cheap.
http://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203
Your fact tables are for processes or events that you want to analyze. You should name them noun_verb_noun (example customers_order_items). If you can't come up with a name like that, you probably don't have a fact table. What is your Customer Fact table for? Customer is usually a dimension table.
The purpose of your data warehouse is to facilitate analysis. Use longer column names (with _ as word separator). Make life easy on your analysts.

How to build a structure for Parameters?

I'm looking for a proposal for a data structure.
I have used different tables for different parameters. For example one table for cities, and one for emplyees and one for departments and so on.
Since this is not comfortable, I have changed this to just a single table like
id;
File_id;
Param_Name
Param_val
where File_id is integer and it means what is this parameter for, ie 101 it is a city, 102 it is department and so on.
Is this enough or how can I forward one step more (if you have better idea)?

WCF Data Services - neither .Expand or .LoadProperty seems to do what I need

I am building a school management app where they track student tardiness and absences. I've got three entities to help me in this. A Students entity (first name, last name, ID, etc.); a SystemAbsenceTypes entity with SystemAbsenceTypeID values for Late, Absent-with-Reason, Absent-without-Reason; and a cross-reference table called StudentAbsences (matching the student IDs with the absence-type ID, plus a date, and a Notes field).
What I want to do is query my entities for a given student, and then add up the number of each kind of Absence, for a given date range. I prepare my currentStudent object without a problem, then I do this...
Me.Data.LoadProperty(currentStudent, "StudentAbsences") 'Loads the cross-ref data
lblDaysLate.Text = (From ab In currentStudent.StudentAbsences Where ab.SystemAbsenceTypes.SystemAbsenceTypeID = Common.enuStudentAbsenceTypes.Late).Count.ToString
...and this second line fails, complaining "Object reference not set to an instance of an object."
I presume the problem is that while it DOES see that there are (let's say) four absences for the currentStudent (ie, currentStudent.StudentAbsences.Count = 4) -- it can't yet "peer into" each one of the absences to look at its type. In fact, each of the four StudentAbsence objects has a property called SystemAbsenceType, which then finally has the SystemAbsenceTypeID.
How do I use .Expand or .LoadProperty to make this happen? Do I need to blindly loop through all these collections, firing off .LoadProperty on everything before I can do my query?
Is there some other technique?

When you load the student, try expanding the related properties.
var currentStudent = context.Students.Expand("StudentAbsences")
.Expand("StudentAbsences/SystemAbsenceTypes")
.Where(....).First();

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio