I have to design data warehouse model and ETL process for class at my University. My data warehouse has to store opinions / comments about a product, each record should consist of:
comment text (String)
product score ({0, 0.5, … , 4.5, 5})
comment author (String)
comment date (Date)
product recommendation ({Yes, No})
comment up votes (Int)
comment down votes (Int)
product pros (many Strings, e.g {price, design, durability, … }) and its count
product cons (many Strings, e.g {too loud, too heavy, price, … }) and
its count
In addition data warehouse should store information about product:
product category
product brand
product model
I want to create data warehouse model first, but I have problem with storing product pros and cons as it is many-to-many relationship. In normal relational database I would simply create associative table, but here I am not sure how to proceed, after all I don’t want to normalize facts table.
I am considering 3 approaches, first, which I presented in diagram below. I used bridge table method (though, I don’t know if correctly) to get rid of many-to-many relationship. I don’t know how it will impact querying performance.
Second approach I may use is boolean column method. In PROS and CONS table I can create a column for each possible value, but there can be up to 100 different pros or cons. Also number of possible pros or cons is not constant in time. Authors in their comments can list new pros or cons (that’s how it works in data source), but I can’t add new columns (I shouldn’t change data in data warehouse).
Third approach I am considering, is to keep pros in PROS table but in 1 column, where values will be separated using commas or some other delimiter e.g. “price, design, color”. It keeps things simple but hard to analyze or slice & dice.
Which approach should I use in this situation? Which is better for loading data into data warehouse, because form data source I will get all the comments and I want to only load comments that are new since last loading?
What I think is, if we can get your first option little bit modified to than what you have said here, it would be the best as I understand.
in your image you have provided, having the Pros_Bridge_Detail table is fine. The rest need to be changed.
you can remove the pros_Bridge table that holds just the count. you can actually add that column to your COMMENT fact table you have up there. That would be more efficient and easy when it comes to queries rather than querying in many tables.
you said you have many areas to give pros like price, design, durability etc. Lets put those stuff into a separate dimension.
Add a new column to your Pros_Bridge_Detail table to hold the ID of the newly created Dimension that holds the product pro types (Design, durability etc).
Now, once you add a product Pro, the Pros_Bridge_Detail table will have the pros the user give and also hold the value of regarding what the pro is given via the ID of the new dimension.
Also don't forget to store the Comment ID as well in Pros_Bridge_Detail table as that will be your link (FK) to Comments fact table you have.
Same can be done to Cons as well.
Hope you understand what I just explained and hope it helps. let know if you have any issues.
Related
Hello I am working with Laravel,
I have to create two simple models, let's say Stores and Books.
Stores can have one or multiple Books and Books can belong to many Stores.
Of course I will use a many to many relationship, with a pivot table.
Books the can have different prices depending the store.
I think a separate table can only complicate things, in my mind the pivot table associating books and stores should have a price column, but pivot tables only contains store_id and book_id.
Should I create a book_prices and associate it with books and to stores? What is the best approach?
You are free and able to set other attributes on your pivot table. You can read more about it in the docs.
https://laravel.com/docs/9.x/eloquent-relationships#retrieving-intermediate-table-columns
You have to define the relationship accordingly, the following should clarify how this works. In this example you use the many-to-many relationship and add the price column to every retrieved pivot model.
public function books()
{
return $this->belongsToMany(Book::class)
->withPivot('price')
}
For example, you are able to access the pivot column in a loop like this
foreach ($shop->books as $book)
{
echo $book->pivot->price;
}
You can define additional columns for your pivot table in the migration for the pivot table, and then when defining the relationship use withPivot to define the additional columns so they come through in the model:
return $this->belongsToMany(Book::class)->withPivot('price');
(Adapted from the Laravel documentation, see https://laravel.com/docs/9.x/eloquent-relationships#retrieving-intermediate-table-columns)
Depends on the complexity of your case, but yes, you have two options for it. Let's say that the pivot table is called as book_store:
Directly adds price column to book_store. This is obviously the simpler option. The drawbacks are:
The history of the price changes isn't logged. You'll have to create another table for logging if you want to keep this history information.
Changes made to price will directly change the price of the related book_store record. Meaning that a price is being updated "live" e.g users cannot update the price now but "publish" it some time later just like this example in the doc.
Create a new, different table to store the price. This may seems relatively more complex, but it may also be more future-proof.
Basically, you get 2 things that you miss in the first option above.
Don't think too much about book_store being a pivot table. One way to see it is like this: book_store IS a pivot table from books and stores tables viewpoints, but it's also just a normal SQL table which could relate to any other tables using any kind of relationships.
If you want to implement this, make sure to create a primary-key in the book_store table.
Alast, it all depends on what you need. Feel free to ask if you need more insight about this. I hope this helps.
Which tables are connected with the process of reindexing of index in magento.
Please share any documents available for the same.
Can't take credit for this as it is taken from original post at: Can someone explain Magentos Indexing feature in detail?
Magento's indexing is only similar to database-level indexing in spirit. As Anton states, it is a process of denormalization to allow faster operation of a site. Let me try to explain some of the thoughts behind the Magento database structure and why it makes indexing necessary to operate at speed.
In a more "typical" MySQL database, a table for storing catalog products would be structured something like this:
PRODUCT:
product_id INT
sku VARCHAR
name VARCHAR
size VARCHAR
longdesc VARCHAR
shortdesc VARCHAR
... etc ...
This is fast for retrieval, but it leaves a fundamental problem for a piece of eCommerce software: what do you do when you want to add more attributes? What if you sell toys, and rather than a size column, you need age_range? Well, you could add another column, but it should be clear that in a large store (think Walmart, for instance), this would result in rows that are 90% empty and attempting to maintenance new attributes is nigh impossible.
To combat this problem, Magento splits tables into smaller units. I don't want to recreate the entire EAV system in this answer, so please accept this simplified model:
PRODUCT:
product_id INT
sku VARCHAR
PRODUCT_ATTRIBUTE_VALUES
product_id INT
attribute_id INT
value MISC
PRODUCT_ATTRIBUTES
attribute_id
name
Now it's possible to add attributes at will by entering new values into product_attributes and then putting adjoining records into product_attribute_values. This is basically what Magento does (with a little more respect for datatypes than I've displayed here). In fact, now there's no reason for two products to have identical fields at all, so we can create entire product types with different sets of attributes!
However, this flexibility comes at a cost. If I want to find the color of a shirt in my system (a trivial example), I need to find:
The product_id of the item (in the product table)
The attribute_id for color (in the attribute table)
Finally, the actual value (in the attribute_values table)
Magento used to work like this, but it was dead slow. So, to allow better performance, they made a compromise: once the shop owner has defined the attributes they want, go ahead and generate the big table from the beginning. When something changes, nuke it from space and generate it over again. That way, data is stored primarily in our nice flexible format, but queried from a single table.
These resulting lookup tables are the Magento "indexes". When you re-index, you are blowing up the old table and generating it again.
Hello everybody I'm making a "Bulletin board", like this: http://stena.kg/ad/post, I'm using Laravel 5.0, and don't know how to store different fields in database table, for example if I choose "Cars" category I should to fill Mark, Model, Fuel (etc fields for cars category), If I choose Flats category I should fill fields like Area, Number of rooms etc...How to organize all of this? I tried some ideas but nothing helped me(
Try to save data as json in table. Parse json format to string and save it in db, but it will cause many problems in future, so not recommend that solution. I recommend to store data in separate tabels, each one for category. For optimise process it is possible to create catregory table, and category_item table with fields like name, description and so on. Different category demands sp=ecific fields, so best solution is to create table per category.
I have a use case where I need to model reference data for e.g. different flavors of ice cream. Say I have 50 flavors of ice cream :-
20 attributes e.g. freezing-temp, creaminess will be shared across all flavors
every flavor of ice cream would have 20-30 attributes that will not be shared with other flavors e.g. :-
Strawberry ice cream might track tartness, fruit percentage etc.
Chocolate ice cream might track bitterness, cocoa level etc.
How would I model this data neatly in a database model, purely from a storage / retrieval point of view?
The options I can think of :-
One table per flavor. This will need 50 tables, and each table will have 20 columns that will overlap with each other, and another 20-30 attributes that will be unique to the flavor.
Pros : models the data of each flavor quite well
Cons : column overlap and large number of tables needed
One table for all flavors. This will only need 1 table, but will require 1000+ columns most of which would be empty.
Pros : models the data of ice cream in general, quite well
Cons : large number of columns and large amount of 'wasted' space
One key-value table for all flavors, with flavor Id, attribute name and attribute value.
Pros : simplest to create and insert data
Cons : harder to extract, not really a data model per se, difficult to form constraints for attributes, or for attributes related to other attributes
Never store a value in the wrong type.
Whatever design you choose, make sure that values are stored in their natural format. Use NUMBER, DATE, VARCHAR2, CLOB, XMLTYPE, CLOB (IS JSON), TIMESTAMP, etc. Trying to cram everything in a string will cause many problems. You lose validation, convenience, performance, and type safety.
For example, here is a common type safety problem. Imagine this simple query to find ice cream that is more than 25% fruit:
select *
from ice_cream_flavor_attribute
where attribute_name = 'Fruit Percentage'
and attribute_value > 25;
Do you see the bug? Do you see how the same query, with the same data, may work one day and fail the next with ORA-01722: invalid number?
It's difficult to write a query that forces Oracle to evaluate conditions in a specific order. Re-ordering the predicates won't help (99.9% of the time). Adding an inline view won't help (99.9% of the time). Using a CASE statement will work but not 100% of the time. Using hints will work but is tricky. Using an inline view and a ROWNUM is my preferred way of solving the problem but it looks odd and is difficult to understand.
If you must use an Entity Attribute Value model (and if you have more than 1000 attributes it may be unavoidable), at least use the right types.
Don't worry about space - a null column uses at most 1 byte.
Don't worry about complaints like "but then our queries are more complicated, we always need to know which column to use!" - realistically there is almost nothing useful you can do with a value without knowing its type. Every time you read or write a value you must already be thinking about the type.
I'd have one table with all the common attributes, then another for the non-shared attributes. For example:
CREATE TABLE ICE_CREAM_FLAVOR
(FLAVOR VARCHAR2(100) PRIMARY KEY,
FREEZING_TEMP NUMBER,
CREAMINESS NUMBER,
ETC VARCHAR2(25),
BLAH NUMBER);
CREATE TABLE ICE_CREAM_FLAVOR_ATTRIBUTE
(ID_ICF_ATTRIBUTE NUMBER, -- should be populated by an insert trigger
FLAVOR VARCHAR2(100)
NOT NULL
REFERENCES ICE_CREAM_FLAVOR(FLAVOR),
ATTRIBUTE_NAME VARCHAR2(25),
ATTRIBUTE_VALUE VARCHAR2(100));
Your mileage may vary.
Share and enjoy.
I would like to suggest, You can create 3 different tables.
Ice Cream Flavor: You can store all the flavors of ice cream. It will be icecream_flavor_master table. Let say if you have 50 flavors than 50 rows will create, like Strawberry,Chocolate etc.
Ice Cream Attributes: You can store all the attributes of ice cream. It will icecream_attribute_master table. Let say if you have 50 attributes than 50 rows will create, like tartness,bitterness,fruit percentage, cocoa level etc.
Ice Cream Flavor Attributes: You can store primary key of icecream_flavor_master and icecream_attribute_master in this table, to make the relation between flavor and attribute of icecream.
Let me know for further information.
You might be able to group flavors into classes of flavors, ones that share certain attributes. This lends itself to classes and subclasses that extend other classes.
If you want to do ER modeling on this, look up "generalization/specialization" on the web. Some websites will call this a feature of "Extended ER modeling" or EER.
If you want to design relational tables to implement the ER design, look into two patterns: Single Table Inheritance and Class Table Inheritance.
https://stackoverflow.com/tags/single-table-inheritance/info
https://stackoverflow.com/tags/class-table-inheritance/info
Also, look into Martin Fowler's treatment on this subject on the web, or in one of his textbooks.
What big vendors are doing for huge data in ECM (enterprise content management), where you have a quite similar scenario (many custom classes with custom attributes, some of them might be the same, having various types over attributes):
One key-value table for all flavors, with flavor Id, attribute name and attribute value.
They use one key-value table per type (string, number, date etc.).
For performance optimization, they allow to define dedicated tables for attributes, in order to keep index small and not crowded with other attributes.
Dedicated tables make sense for:
Massive usage (having many rows)
Bad histograms (like flags)
Otherwise Oracle index could be tricked, and full table access is the fastest access, which would be really bad.
So think early about performance when having huge amount of data.
My question may seems more general. But only answer I got so far is from the SO itself. My question is, I have a table customer information. I have 47 fields in it. Some of the fields are optional. I would like to split that table into two customer_info and customer_additional_info. One of its column is storing a file in byte format. Is there any advantage by splitting the table. I saw that the JOIN will slow down the query execution. Can I have more PROs and CONs of splitting a table into two?
I don't see much advantage in splitting the table unless some of the columns are very infrequently accessed and fairly large. There's a theoretical advantage to keeping rows small as you're going to get more of them in a cached block, and you improve the efficiency of a full table scan and of the buffer cache. Based on that I'd be wary of storing this file column in the customer table if it was more than a very small size.
Other than that, I'd keep it in a single table.
I can think of only 2 arguments in favor of splitting the table:
If all the columns in Customer_Addition_info are related, you could potentially get the benefit of additional declarative data integrity that you couldn't get with a single table. For instance, lets say your addition table was CustomerAddress. Your business logic may dictate that a customer address is optional, but once you have a customer Zip code, the addressL1, City and State become required fields. You could set these columns to non null if they exist in a customerAddress table. You couldn't do that if they existed directly in the customer table.
If you were doing some Object-relational mapping and your had a customer class with many subclasses and you didn't want to use Single Table Inheritance. Sometimes STI creates problems when you have similar properties of various subclasses that require different storage layout. Being that all subclasses have to use the same table, you might have name clashes. The alternative is Class Table inheritance where you have a table for the superclass, and an addition table for each subclass. This is a similar scenario to the one you described in your question.
As for CONS, The join makes things harder and slower. You also run the risk of accidentally creating a 1 to many relationship. I.E. You create 2 addresses in the CustomerAddress table and now you don't know which one is valid.
EDIT:
Let me explain the declarative ref integrity point further.
If your business rules are such that a customer address is optional, and you embed addressL1, addressL2, City, State, and Zip in your customer table, you would need to make each of these fields Nullable. That would allow someone to insert a customer with a City but no state. You could write a table level check constraint to cover this situation. But that isn't as easy as simply setting the AddressL1, City, State and Zip columns in the CustomerAddress table not nullable. To be clear, I am NOT advocating using the multi-table approach. However you asked for Pros and Cons, and I'm just pointing out this aspect falls on the pro side of the ledger.
I second what David Aldridge said, I'd just like to add a point about the file column (presumably BLOB)...
BLOBs are stored up to approx. 4000 bytes in-line1. If a BLOB is used rarely, you can specify DISABLE STORAGE IN ROW to store it out-of-line, removing the "cache pollution" without the need to split the table.
But whatever you do, measure the effects on realistic amounts of data before you make the final decision.
1 That is, in the row itself.