Struggling with a data modeling problem - datamodel

I am struggling with a data model (I use MySQL for the database). I am uneasy about what I have come up with. If someone could suggest a better approach, or point me to some reference matter I would appreciate it.
The data would have organizations of many types. I am trying to do a 3 level classification (Class, Category, Type). Say if I have 'Italian Restaurant', it will have the following classification
Food Services > Restaurants > Italian
However, an organization may belong to multiple groups. A restaurant may also serve Chinese and Italian. So it will fit into 2 classifications
Food Services > Restaurants > Italian
Food Services > Restaurants > Chinese
The classification reference tables would be like the following:
ORG_CLASS (RowId, ClassCode, ClassName)
1, FOOD, Food Services
ORG_CATEGORY(RowId, ClassCode, CategoryCode, CategoryName)
1, FOOD, REST, Restaurants
ORG_TYPE (RowId, ClassCode, CategoryCode, TypeCode, TypeName)
100, FOOD, REST, ITAL, Italian
101, FOOD, REST, CHIN, Chinese
102, FOOD, REST, SPAN, Spanish
103, FOOD, REST, MEXI, Mexican
104, FOOD, REST, FREN, French
105, FOOD, REST, MIDL, Middle Eastern
The actual data tables would be like the following:
I will allow an organization a max of 3 classifications. I will have 3 GroupIds each pointing to a row in ORG_TYPE. So I have my ORGANIZATION_TABLE
ORGANIZATION_TABLE (OrgGroupId1, OrgGroupId2, OrgGroupId3, OrgName, OrgAddres)
100,103,NULL,MyRestaurant1, MyAddr1
100,102,NULL,MyRestaurant2, MyAddr2
100,104,105, MyRestaurant3, MyAddr3
During data add, a dialog could let the user choose the clssa, category, type and the corresponding GroupId could be populated with the rowid from the ORG_TYPE table.
During Search, If all three classification are chosen, It will be more specific. For example, if
Food Services > Restaurants > Italian is the criteria, the where clause would be 'where OrgGroupId1 = 100'
If only 2 levels are chosen
Food Services > Restaurants
I have to do 'where OrgGroupId1 in (100,101,102,103,104,105, .....)' - There could be a hundred in that list
I will disallow class level search. That is I will force selection of a class and category
The Ids would be integers. I am trying to see performance issues and other issues.
Overall, would this work? or I need to throw this out and start from scratch.

I don't like the having three columns for the "up to three" classifications. In my opinion it would be better to have a cross-reference table that allows your many-to-many mapping between organisation and type, i.e. table ORGANISATION_GROUPS with columns OrganisationId, OrgGroupId.
To sort out the problem of being able to query a different levels of classification specified you could setup this cross-ref table to hold the actual classifications, i.e. ORGANISATION_GROUPS instead has columnns: OrganisationId, ClassCode, CategoryCode, TypeCode.
This will make queries at different levels of classification very easy.
For referential integrity to work with this scheme I'd then suggest not using surrogate integer keys for your ORG_* tables but instead setting the primary key to be the real unique key, i.e. ClassCode, CategoryCode, TypeCode for ORG_TYPE.

The problem i see in your design is that it is a bit rigid. A more flexible approach you might want to consider is following:
First you would have a table for classes, categories, types and any other classification type. This table would be auto-referenced. All registers would have a field referring to its immediate parent, like following:
CLASSIFICATION (Id, Description, Parent_Id)
ITAL, Italian, REST
CHIN, Chinese, REST
MEXI, Mexican, REST
REST, Restaurant, FOOD
Next you would have, as #John pickup suggested, an intermediate cross-reference table between your restaurant (or whatever you need) table and the classification table which would contain only a composite primary key, being its components the primary key of both tables.
FOODSERVICE_CLASSIFICATION (Rest_Id, Class_Id)
100, ITAL
100, CHIN
101, MEXI
102, CHIN
It would be advisable to limit it so that only leaf registers of the CLASSIFICATION table can be referenced in the cross-reference table.
Your example of looking for all restaurants would be as simple as looking for all child categories of REST and search for them in the cross-reference table. This can be written in a single select in Oracle (not sure about other RDBMS).
This way you can:
have multiple categorization for your restaurants without being limited to 3 categories.
Do quick searches using the cross-reference table.
Mind you, this schema would work supposing your categorization is like a tree with a base category acting as the root. If instead you need a more loose categorization you would probably need a tags approach.
Btw, I also agree with #John Pickup that it is better to use real primary keys in this case.
HTH

Related

DM and hierarchies - dimensions for future use

My very first DM so be gentle..
Modeling a hierarchy with ERD as follows:
Responses are my facts. All the advice I've seen indicates creating a single dimension (say dim_event) and denormalizing event, department and organization into that dimension:
What if I KNOW that there will be future facts/reports that rely on an Organization dimension, or a Department dimension that do not involve this particular fact?
It makes more sense to me (from the OLTP world) to create individual dimensions for the major components and attach them to the fact. That way they could be reused as conformed dimensions.
This way for any updating dimension attributes there would be one dim table; if I had everything denormalized I could have org name in several dimension tables.
--Update--
As requested:
An "event" is an email campaign designed to gather response data from a specific subset of clients. They log in and we ask them a series of questions and score the answers.
The "response" is the set of scores we generate from the event.
So an "event" record may look like this:
name: '2019 test Event'
department: 'finance'
"response" records look something like this:
event: '2019 test Event'
retScore: 2190
balScore: 19.98
If your organization and department are tightly coupled (i.e. department implies organization as well), they should be denormalized and created as a single dimension. If department & organization do not have a hierarchical relationship, they would be separate dimensions.
Your Event would likely be a dim (degenerate) and a fact. The fact would point to the various dimensions that describe the Event and would contain the measures about what happened at the Event (retScore, balScore).
A good way to identify if you're dealing with a dim or a fact is to ask "What do I know before any thing happens?" I expect you'd know which orgs & depts are available. You may even know certain types of recurring events (blood drive, annual fundraiser), which could also be a separate dimension (event type). But you wouldn't have any details about a specific event, HR Fundraiser 2019 (fact), until one is scheduled.
A dimension represents the possibilities, but a fact record indicates something actually happens. My favorite analogy for this is a restaurant menu vs a restaurant order. The items on the menu can be referenced even if they've never been ordered. The menu is the dimension, the order is the fact.
Hope this helps.

Is Elasticsearch X-Pack able to return graph vertices across different types?

I have product type data loaded into Elasticsearch containing catalogue_number and name. I also have customer data loaded into Elasticsearch containing name and purchases (where purchases is an array of product numbers).
For example:
CATALOGUE_NUMBER, NAME
518, "Toilet Paper"
388, "Candy Bar"
263, "Carrots"
And, for customers:
NAME, PURCHASES
"Jack", [518, 388]
"John", [263]
"Bill", [263, 518]
Considering the relationship is many to one (i.e. customers purchase many items), am I able to use Kibana to view a graph linking purchases to specific customers, or is this out of scope?
My end goal is to have a graph showing product and customer as vertices and edges showing which products each customer purchases. I am very confused as to whether Elasticsearch is capable, or if I should move to a pure graph database such as Neo4J and Elasticsearch for searching only.
The Graph feature can draw out these connections if they share a common field name - the unique identity of a node is a field name and a term. Terms can be in different indices but as long as they share a common field name they are seen as the same node.
I'm not sure which business problem you are trying to solve (recommendations? Fraud?) but depending on what you are trying to achieve you may want to model things differently.
If you're interested in recommendations and people who-bought-X-also-bought-Y style suggestions then the people are unlikely to be interesting nodes to plot and you can just examine the "purchases" field which will draw out which products significantly co-occur.
For more detailed "forensic" type applications you may want to just have person->product links and not have product->product links in which case you would be forced to create more classical "edge-like" documents with only 2 nodes - a person ID and a product ID.

How to model an OLTP audit table in dimensional schema?

We have an audit table which we get from OLTP system, it records any activity done by the user including if he downloaded some attachment, or read some note or written some note , or any change for an incident etc.How do we include these audit table activity in our dimensional model for incident management system(IT service management)?
On a simple level, which is all I can provide based on the level of detail in the question, is to look at your audit table and decide which categories of audit you want to be a dimension. Perhaps there are audit_type, user_type, and audit_subtype fields or something like that? Also, typically you have another field called a "measure" or "quantity", which is typically used for stats on numerics, to support aggregate functions. For example, you might typically have store_id, product_cat as categorical dimensions, but roll up sales$ as min,max,avg,stdev grouped by different date types like month, quarter and other dimensions. If your data is purely categorical by date, then COUNT() is usually used as a calculated measure.
You really just need to decide how you want to be able to drill up and drill down though the data, which categories matter, and which quantities matter. Once you decide that, create a flat table with FKs to lookup tables. A star schema is simply a fat table with a bunch of lookup tables floating around it like a star.
Hope this helps

Modeling many-to-many relationship in data warehouse

I have to design data warehouse model and ETL process for class at my University. My data warehouse has to store opinions / comments about a product, each record should consist of:
comment text (String)
product score ({0, 0.5, … , 4.5, 5})
comment author (String)
comment date (Date)
product recommendation ({Yes, No})
comment up votes (Int)
comment down votes (Int)
product pros (many Strings, e.g {price, design, durability, … }) and its count
product cons (many Strings, e.g {too loud, too heavy, price, … }) and
its count
In addition data warehouse should store information about product:
product category
product brand
product model
I want to create data warehouse model first, but I have problem with storing product pros and cons as it is many-to-many relationship. In normal relational database I would simply create associative table, but here I am not sure how to proceed, after all I don’t want to normalize facts table.
I am considering 3 approaches, first, which I presented in diagram below. I used bridge table method (though, I don’t know if correctly) to get rid of many-to-many relationship. I don’t know how it will impact querying performance.
Second approach I may use is boolean column method. In PROS and CONS table I can create a column for each possible value, but there can be up to 100 different pros or cons. Also number of possible pros or cons is not constant in time. Authors in their comments can list new pros or cons (that’s how it works in data source), but I can’t add new columns (I shouldn’t change data in data warehouse).
Third approach I am considering, is to keep pros in PROS table but in 1 column, where values will be separated using commas or some other delimiter e.g. “price, design, color”. It keeps things simple but hard to analyze or slice & dice.
Which approach should I use in this situation? Which is better for loading data into data warehouse, because form data source I will get all the comments and I want to only load comments that are new since last loading?
What I think is, if we can get your first option little bit modified to than what you have said here, it would be the best as I understand.
in your image you have provided, having the Pros_Bridge_Detail table is fine. The rest need to be changed.
you can remove the pros_Bridge table that holds just the count. you can actually add that column to your COMMENT fact table you have up there. That would be more efficient and easy when it comes to queries rather than querying in many tables.
you said you have many areas to give pros like price, design, durability etc. Lets put those stuff into a separate dimension.
Add a new column to your Pros_Bridge_Detail table to hold the ID of the newly created Dimension that holds the product pro types (Design, durability etc).
Now, once you add a product Pro, the Pros_Bridge_Detail table will have the pros the user give and also hold the value of regarding what the pro is given via the ID of the new dimension.
Also don't forget to store the Comment ID as well in Pros_Bridge_Detail table as that will be your link (FK) to Comments fact table you have.
Same can be done to Cons as well.
Hope you understand what I just explained and hope it helps. let know if you have any issues.

Database schema design for large number of columns

I have a use case where I need to model reference data for e.g. different flavors of ice cream. Say I have 50 flavors of ice cream :-
20 attributes e.g. freezing-temp, creaminess will be shared across all flavors
every flavor of ice cream would have 20-30 attributes that will not be shared with other flavors e.g. :-
Strawberry ice cream might track tartness, fruit percentage etc.
Chocolate ice cream might track bitterness, cocoa level etc.
How would I model this data neatly in a database model, purely from a storage / retrieval point of view?
The options I can think of :-
One table per flavor. This will need 50 tables, and each table will have 20 columns that will overlap with each other, and another 20-30 attributes that will be unique to the flavor.
Pros : models the data of each flavor quite well
Cons : column overlap and large number of tables needed
One table for all flavors. This will only need 1 table, but will require 1000+ columns most of which would be empty.
Pros : models the data of ice cream in general, quite well
Cons : large number of columns and large amount of 'wasted' space
One key-value table for all flavors, with flavor Id, attribute name and attribute value.
Pros : simplest to create and insert data
Cons : harder to extract, not really a data model per se, difficult to form constraints for attributes, or for attributes related to other attributes
Never store a value in the wrong type.
Whatever design you choose, make sure that values are stored in their natural format. Use NUMBER, DATE, VARCHAR2, CLOB, XMLTYPE, CLOB (IS JSON), TIMESTAMP, etc. Trying to cram everything in a string will cause many problems. You lose validation, convenience, performance, and type safety.
For example, here is a common type safety problem. Imagine this simple query to find ice cream that is more than 25% fruit:
select *
from ice_cream_flavor_attribute
where attribute_name = 'Fruit Percentage'
and attribute_value > 25;
Do you see the bug? Do you see how the same query, with the same data, may work one day and fail the next with ORA-01722: invalid number?
It's difficult to write a query that forces Oracle to evaluate conditions in a specific order. Re-ordering the predicates won't help (99.9% of the time). Adding an inline view won't help (99.9% of the time). Using a CASE statement will work but not 100% of the time. Using hints will work but is tricky. Using an inline view and a ROWNUM is my preferred way of solving the problem but it looks odd and is difficult to understand.
If you must use an Entity Attribute Value model (and if you have more than 1000 attributes it may be unavoidable), at least use the right types.
Don't worry about space - a null column uses at most 1 byte.
Don't worry about complaints like "but then our queries are more complicated, we always need to know which column to use!" - realistically there is almost nothing useful you can do with a value without knowing its type. Every time you read or write a value you must already be thinking about the type.
I'd have one table with all the common attributes, then another for the non-shared attributes. For example:
CREATE TABLE ICE_CREAM_FLAVOR
(FLAVOR VARCHAR2(100) PRIMARY KEY,
FREEZING_TEMP NUMBER,
CREAMINESS NUMBER,
ETC VARCHAR2(25),
BLAH NUMBER);
CREATE TABLE ICE_CREAM_FLAVOR_ATTRIBUTE
(ID_ICF_ATTRIBUTE NUMBER, -- should be populated by an insert trigger
FLAVOR VARCHAR2(100)
NOT NULL
REFERENCES ICE_CREAM_FLAVOR(FLAVOR),
ATTRIBUTE_NAME VARCHAR2(25),
ATTRIBUTE_VALUE VARCHAR2(100));
Your mileage may vary.
Share and enjoy.
I would like to suggest, You can create 3 different tables.
Ice Cream Flavor: You can store all the flavors of ice cream. It will be icecream_flavor_master table. Let say if you have 50 flavors than 50 rows will create, like Strawberry,Chocolate etc.
Ice Cream Attributes: You can store all the attributes of ice cream. It will icecream_attribute_master table. Let say if you have 50 attributes than 50 rows will create, like tartness,bitterness,fruit percentage, cocoa level etc.
Ice Cream Flavor Attributes: You can store primary key of icecream_flavor_master and icecream_attribute_master in this table, to make the relation between flavor and attribute of icecream.
Let me know for further information.
You might be able to group flavors into classes of flavors, ones that share certain attributes. This lends itself to classes and subclasses that extend other classes.
If you want to do ER modeling on this, look up "generalization/specialization" on the web. Some websites will call this a feature of "Extended ER modeling" or EER.
If you want to design relational tables to implement the ER design, look into two patterns: Single Table Inheritance and Class Table Inheritance.
https://stackoverflow.com/tags/single-table-inheritance/info
https://stackoverflow.com/tags/class-table-inheritance/info
Also, look into Martin Fowler's treatment on this subject on the web, or in one of his textbooks.
What big vendors are doing for huge data in ECM (enterprise content management), where you have a quite similar scenario (many custom classes with custom attributes, some of them might be the same, having various types over attributes):
One key-value table for all flavors, with flavor Id, attribute name and attribute value.
They use one key-value table per type (string, number, date etc.).
For performance optimization, they allow to define dedicated tables for attributes, in order to keep index small and not crowded with other attributes.
Dedicated tables make sense for:
Massive usage (having many rows)
Bad histograms (like flags)
Otherwise Oracle index could be tricked, and full table access is the fastest access, which would be really bad.
So think early about performance when having huge amount of data.

Resources