Database design: Low overhead solution for managing daily inventories / capacities? - performance

Here is the scenario: (MySQL 5.1+, PHP, Apache)
I am planning a SaaS application that will let CLIENTS visit SHOPS and book TRIPS. (ALL CAPS are entities). SHOPS offer TRIPS but they only have a certain number of EMPLOYEES to guide the TRIPS (a transactional record). Essentially it is an issue of managing a daily capacity for each SHOP based upon the number of available EMPLOYEES. What is the best DB design solution for delivering this functionality in a way that incurs the lowest amount of overhead?
Here is a simplified view of the database entities:
table.clients
client_id (pk, ai)
table.shops
shop_id (pk, ai)
table.employees
employee_id (pk, ai)
shop_id (fk)
table.trips
trip_id (pk, ai)
client_id (fk)
shop_id (fk)
trip_date (date)
SCENARIO 1
I could run a query on TRIPS for every request when a user wants to view the calendar, like:
SELECT COUNT(*),
trips.trip_date,
trips.shop_id
FROM trips
WHERE shop_id=1
GROUP BY trips.trip_date, trips.shop_id
SCENARIO 2
Create a summary table that stored info on every day but this strategy seems nightmarish with overhead issues. For instance, imagine that there are 1000 shops each booking 1000 trips per 365 day year and the table should store info for the next 2 years (830 days). It seems like that would 1/ create a huge summary table (830,000 rows) that would 2/ be queried 1,000,000+ times per year (1000 shops * 1000 trips per shop). When a CLIENT booked a TRIP it would increment the number (or when a trip was cancelled the number would decrement) which would effectively create a daily inventory/capacity.
So, my question is this: Which method is the best? Or is there a better way to accomplish this?
Thanks!

Sounds like fun!
Firstly - I know you've given us a simplified version of the schema, so I assume there's a lot more elsewhere, but your "trips" table looks wrong - if shops have one and only one client, you don't need the client ID in the trips table.
However, you do need a "booked_trips" table, to record which trip is booked to which employee - you could store that against the "trips" table too, but typically a booking has lots of other stuff like an invoice, a booked date etc. so you may want to separate those things out.
I'd recommend something like your "option 1"- use queries to derive data stored in normalized tables, rather than option 2, which is effectively a denormalization for speed.
It's worth defining "overhead" in your question - pretty much all of these design questions trade time versus speed; if by overhead you mean disk space, you get a different answer than if you mean "time to run my queries".
Generally, my advice is to work with a normalized approach and measure performance; only denormalize if you know you have a problem.

Related

USERELATIONSHIP in a measure with text column

CONTEXT:
I want to monitor payment transactions for money laundering, where payments cross multiple borders. There are a max of 6 countries shown per transaction. For each of these countries, I need to know a risk level.
I have 2 tables:
Transaction data (where there are many transactions from same country)
Country Risk (containing each country once, with an added risk classification. There are 100+ countries, and only 6 different Risk levels).
Problem:
I would like to look up the Risk Classification per country in Transaction Data. The problem is, there are 6 countries per transaction in Transaction Data. So I have to link Transaction data to Country Risk 6 times. Only 1 relationship can be active, of course.
So I tried writing the following measure:
CALCULATE(
VALUES('Country Risk'[Risk classification]);
USERELATIONSHIP('Transaction Data'[Country 2];'Country Risk'[Country Code]))
I get an error though when using the measure in a graph where I listed the countries from Transaction Data (where every country is mentioned in multiple rows) and I wanted to see the related risk categories:
A table of multiple values was supplied where a single value was expected
What am I doing wrong?
Made similar test data: https://drive.google.com/file/d/1_kJW-BpbrwCsbSpxdo7AJ3IzPy2oLWFJ/view?usp=sharing
Needed:
for every C (C1-C6) column I need to add the risk category.
For every C column I need to make a pie chart showing the number of transactions per risk category for that C column
Pie charts should filter the transaction oevrview: (
I've taked a PBI consultant about this, there is no way to get this issue solved the way I want it to (to have multiple relationships between 2 tables all acting as if they were active relationships at same time).
the only way of getting it done would be:
1. write measures (but that doesn't allow cross filtering between pie chart and table below)
2. unpivot the country columns (but that wouldn't allow to have 6 columns with risk category in table)
3. have 6 dimension tables (this solves the issue, because it allows both cross filtering between piechart and other piecharts & table. and it would allow to have 6 columns for separate risks in the table visual)
thanks for trying to help guys!

How to model an OLTP audit table in dimensional schema?

We have an audit table which we get from OLTP system, it records any activity done by the user including if he downloaded some attachment, or read some note or written some note , or any change for an incident etc.How do we include these audit table activity in our dimensional model for incident management system(IT service management)?
On a simple level, which is all I can provide based on the level of detail in the question, is to look at your audit table and decide which categories of audit you want to be a dimension. Perhaps there are audit_type, user_type, and audit_subtype fields or something like that? Also, typically you have another field called a "measure" or "quantity", which is typically used for stats on numerics, to support aggregate functions. For example, you might typically have store_id, product_cat as categorical dimensions, but roll up sales$ as min,max,avg,stdev grouped by different date types like month, quarter and other dimensions. If your data is purely categorical by date, then COUNT() is usually used as a calculated measure.
You really just need to decide how you want to be able to drill up and drill down though the data, which categories matter, and which quantities matter. Once you decide that, create a flat table with FKs to lookup tables. A star schema is simply a fat table with a bunch of lookup tables floating around it like a star.
Hope this helps

Star Schema: How the fact table aggregations are performed?

https://web.stanford.edu/dept/itss/docs/oracle/10g/olap.101/b10333/globdiag.gif
Assume that we have a start schema as above..
My questions is - In real-time how do we populate the colums (unit_price, unit_cost) columns of the fact table..?
Can anyone provide me a start schema tables with real data?
I am having hard time in understanding star schema...
Please help!..
Start schema consists of two types of tables fact tables and dimensions.
The ideal of the star design is that you can split your data in two part.
The static part is described with dimensions and the dynamic part (= transactions) in the fact table.
Each transaction is stored in the fact table as a new record and is connected to the surrounding dimensions, that define the context of the transaction.
The example in link contains two fact tables: SHIPMENTS and PRODUCT_CONDITIONS.
Note that the fact tables in the link are dubbed UNITS_HISTORY_FACT and PRICE_AND_COST_HISTORY_FACT, but I find this not a best choice.
The SHIPMENTS table stores one record for each shipment of a PRODUCT to a CUSTOMER at some TIME, via a defined CHANNEL.
All the above information is defined using the corresponding keys of the respective dimensions.
The fact table also contains MEASURES describing the attributes of the transaction, here the number of UNITS shipped.
The structure of the fact table would be therefore
CUSTOMER_ID
PRODUCT_ID
TIME_ID
CHANNEL_ID
UNITS
The second fact table (bottom) is more interesting, because here you split the product in two parts:
PRODUCT dimension defining the ID, name and other more static attributes
PRODUCT_CONDITION this is fact table, designed with the expectation the price and cost of the product will change over time.
With each change of the price or cost insert a new record in the fact table and connect it to the PRODUCT and TIME (of change).
The structure of the fact table would be therefore
PRODUCT_ID
TIME_ID
UNIT_PRICE
UNIT_COST
Final note the the design of the TIME dimension.
The best practice to connect the fact table with the dimension tables is to use meaningless ID (surrogate keys), but for TIME dimension you should be careful. For big (time partitioned) fact table is often used the natural key (DATE format) to be able to deploy the partitioning features. See more details in How I Defined a Time Dimension Using a Surrogate Key and other resources in web.

typed data set; parent/child select and update with ONE trip to the database (for each op)?

Is it possible, using an ADO.NET typed DataSet containing two tables in a parent/child relationship, to populate the DataSet with ONE trip to the d/b (query could return one or two tables; if one, then result set has columns from both tables, right?), and to update the d/b with ONE trip to the d/b (call to generated stored proc, I guess).
By "is it possible", I mean is it possible to have Visual Studio (2012) automagically generate the classes and SQL code to make this happen?
Or am I kind of on my own? It's looking an awful lot like VS really wants to generate one d/b server round trip for each table involved.
*I guess the update stored proc would have to take table-typed parameters from both parent and child, and perform inserts/updates/deletes appropriately.
Yes, one round trip per table is the way to go.
(- It's certainly possible to use a join query to populate a datatable but VS will then be reluctant to generate update etc SQL. This may or may not be a problem, depending on what you intend to do with the dataset.)
But if you have two tables in a dataset, lets say customers - orders, then you would typically use two queries, and two trips to the db:
SELECT * FROM customers WHERE customers.customerid=#customerid
and
SELECT * FROM orders WHERE orders.customerid=#customerid
Somewhat more counter-intuitive is the situation where you want all customers and orders for one country:
SELECT * FROM customers WHERE customers.countryid=#countryid
and
SELECT orders.* FROM orders INNER JOIN customers ON customers.customerid=orders.customerid WHERE customers.countryid=#countryid
Note how the join query returns data from only one table, but uses the join to identify which rows to return.
Then, once you have the data in your dataset, you can navigate it using the getparentrow and getchildrows methods. This is how ADO.Net manages hierarchical data.
You do need this one-table-at-a-time approach, because, assuming you have foreign key constraints in your db, you need to insert and update in reverse order from delete.
EDIT Yes, this does mean that in some circumstances, depending on the data you want and the structure of your primary keys, you could end up with a humungous set of JOINS that still only pull the data from the table at the end of the hierarchy. This might seem wrong in terms of traditional SQL, but actually it's fine. The time you have lost in the multiple, more complex queries is saved by the reduced amount of data you have to pull back across the wire, compared with one big join query that would be returning multiple copies of the parent data.

How are application like twitter implemented?

Suppose A follows 100 person,
then will need 100 join statement,
which is horrible for database I think.
Or there are other ways ?
Why would you need 100 Joins?
You would have a simple table "Follows" with your ID and the other persons ID in it...
Then you retrieve the "Tweets" by joining something like this:
Select top 100
tweet.*
from
tweet
inner join
followers on follower.id = tweet.AuthorID
where
followers.masterID = yourID
Now you just need a decent caching and make sure you use a non locking query and you have all information... (Well maybe add some userdata into the mix)
Edit:
tweet
ID - tweetid
AuthorID - ID of the poster
Followers
MasterID - (Basically your ID)
FollowerID - (ID of the person following you)
The Followers table has a composite ID based on master and followerID
It should have 2 indexes - one on "masterID - followerID" and one on "FollowerID and MasterID"
The real trick is to minimize your database usage (e.g., cache, cache, cache) and to understand usage patterns. In the specific case of Twitter, they use a bunch of different techniques from queuing, an insane amount of in-memory caching, and some really clever data flow optimizations. Give Scaling Twitter: Making Twitter 10000 percent faster and the other associated articles a read. Your question about how you implement "following" is to denormalize the data (precalculate and maintain join tables instead of performing joins on the fly) or don't use a database at all. <-- Make sure to read this!

Resources