Which structure is better for my cms platform - model-view-controller

I am designing our cms platform. I meet the question about how combine response data. I know two way solution1 and solution2. Which is better? Why? Anyone have better way.
Solution1:
Solution2:

You will have to share the context with us to a definite answer but will the information available I can give you the following.
Solution 1 is tightly coupled entities are in the long run can be a problem but it will be easier to maintain if you have coupled entities in the correct manner.
Solution 2 has a better separation of concerns but you will have to clearly identify entities and will have multiple files to maintain.
Usually a single entity represents a single table in the table with the constraints of the table.

Related

Kedro Data Modelling

We are struggling to model our data correctly for use in Kedro - we are using the recommended Raw\Int\Prm\Ft\Mst model but are struggling with some of the concepts....e.g.
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
Is it OK for a primary dataset to consume data from another primary dataset?
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
I appreciate there are no hard & fast rules with data modelling but these are big modelling decisions & any guidance or best practice on Kedro modelling would be really helpful, I can find just one table defining the layers in the Kedro docs
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
Great question. As you say, there are no hard and fast rules here and opinions do vary, but let me share my perspective as a QB data scientist and kedro maintainer who has used the layering convention you referred to several times.
For a start, let me emphasise that there's absolutely no reason to stick to the data engineering convention suggested by kedro if it's not suitable for your needs. 99% of users don't change the folder structure in data. This is not because the kedro default is the right structure for them but because they just don't think of changing it. You should absolutely add/remove/rename layers to suit yourself. The most important thing is to choose a set of layers (or even a non-layered structure) that works for your project rather than trying to shoehorn your datasets to fit the kedro default suggestion.
Now, assuming you are following kedro's suggested structure - onto your questions:
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
In the case of simple features, a feature dataset can be very similar to a primary one. The distinction is maybe clearest if you think about more complex features, e.g. formed by aggregating over time windows. A primary dataset would have a column that gives a cleaned version of the original data, but without doing any complex calculations on it, just simple transformations. Say the raw data is the colour of all cars driving past your house over a week. By the time the data is in primary, it will be clean (e.g. correcting "rde" to "red", maybe mapping "crimson" and "red" to the same colour). Between primary and the feature layer, we will have done some less trivial calculations on it, e.g. to find one-hot encoded most common car colour each day.
Is it OK for a primary dataset to consume data from another primary dataset?
In my opinion, yes. This might be necessary if you want to join multiple primary tables together. In general if you are building complex pipelines it will become very difficult if you don't allow this. e.g. in the feature layer I might want to form a dataset containing composite_feature = feature_1 * feature_2 from the two inputs feature_1 and feature_2. There's no way of doing this without having multiple sub-layers within the feature layer.
However, something that is generally worth avoiding is a node that consumes data from many different layers. e.g. a node that takes in one dataset from the feature layer and one from the intermediate layer. This seems a bit strange (why has the latter dataset not passed through the feature layer?).
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
Building features from the intermediate layer isn't unheard of, but it seems a bit weird. The primary layer is typically an important one which forms the basis for all feature engineering. If your data is in a shape that you can build features then that means it's probably primary layer already. In this case, maybe you don't need an intermediate layer.
The above points might be summarised by the following rules (which should no doubt be broken when required):
The input datasets for a node in layer L should all be in the same layer, which can be either L or L-1
The output datasets for a node in layer L should all be in the same layer L, which can be either L or L+1
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
I'm also interested in seeing what others think here! One possibly useful thing to note is that kedro was inspired by cookiecutter data science, and the kedro layer structure is an extended version of what's suggested there. Maybe other projects have taken this directory structure and adapted it in different ways.
Your question prompted us to write a Medium article better explaining these concepts, it's just been published on Toward Data Science

What algorithms could i implement to improve the general design and performance of a database?

I'm working on a project for university. Do you guys know what kind of algorithms i could implement that would help with the proper design and general performance of a database? Until now i came up with an algorithm that can help the user pick candidate keys and also an algorithm for normalization up to 3NF. Do you have any other ideas or suggestions? Thanks.
This is like asking how you can figure out how to make a car be more efficient. It's such a broad question that it's essentially unanswerable. There are so many moving parts to a car, and each one has its own problems. You really need to understand what each component is doing. In the case of databases, you need to understand the data before you try and fix it. And if you want a good answer, you have to ask the right questions.
A good question should include context on what you are working with, and what you are trying to do. And when it comes to data manipulation, the details are extremely important. How is your data represented? What kind of infrastructure are you working with? What purpose does the data serve, and what processes use this data? If you are working with floating point numbers, are your processes tolerant of small rounding errors? Would your organization even let you make changes to how the data is stored?
In general, adding algorithms to improve data performance is probably largely unnecessary. Databases are designed out of the box to be simple and efficient. If there were a known method to increase efficiency in general without any drawbacks, there's no reason why the designers of the system wouldn't have implemented it already.
I am just putting an answer because I have no way to tell this in the comment section. You need to understand a basic principle in database design and data model construction. What your database is for ? That is the main question, and believe it or not, sometimes people with experience make the same mistake.
As you were saying, 3NF could be good for OLTP systems, but it would be horrendous for Data Warehouse or Reporting Databases where the queries are huge and they work on big batch operations. In those systems denormalization offers always better results.
Once you know what you're database is for, then you can start to apply some "Best Practices" , but even here there is a lot of room for interpretation, and even worse, same principles could be good in one place but very bad in another. I am just going to provide you an example of my real experience
8 years ago I started a project and we have to design a database for a financial application. After some analysis, we decided to use a start model, or dimension-fact model. We decided to create indexes ( including bitmap ) for some tables, even though we were rebuild them during batch to avoid performance degradation.
Funny thing is that after some months, I realised that the indexes were useless, as the users were running queries that were accessing the whole data, mostly analytics and aggregation. Consequence: I drop all indexes.
Is it a good thing to do ? No, it is not, but in my scenario it was the best thing and the performance increased a lot, both in batch and also in user experience.
Summary, like an old friend of mine that that was working in Oracle Support used to tell me: "Performance is an art my friend, not a science"
There are too many database algorithms to list, but below is a structured way of thinking about classes of algorithms that affect database performance.
Algorithm analysis is a helpful way of categorizing and thinking about many database performance problems. While most performance problems are solved with best practices and trial-and-error, we'll never truly understand why one solution is better than another without understanding the algorithms behind them. Below is a list of functions that describe the algorithmic complexity of different database operations, ordered from fastest to slowest.
O(1/N) – Batching to reduce overhead for bulk collect, sequences, fetching rows
O(1) - Hashing for hash partitioning, hash clusters, hash joins
O(LOG(N)) – Index access for b-trees
1/((1-P)+P/N) – Amdahl's Law and its implications for parallelizing large data warehouse workloads
O(N) - Full table scans, hash joins (in theory)
O(N*LOG(N)) – Full table scan versus repeated index reads, sorting, global versus local indexes, gathering statistics (distinct approximations and partition birthday problems)
O(N^2) – Cross joins, nested loops, parsing
O(N!) – Join order
O(∞) – The optimizer (satisficing and avoiding the halting problem)
One suggestion - based on the way you phrased your questions and comments, you're thinking of a database as merely a place to store data. But the most interesting parts of a database happen when you think of them as joining machines. There's not much to optimize about data sitting around, the real work happens when data is combined.
The above list is based on Chapter 16 of my book, Pro Oracle SQL Development. You can read an early version of the entire chapter for free here. While the chapter mostly stands alone, it requires an advanced understanding of Oracle. But each of the topics could be the basis for a lifetime of academic study, so you only need to pick one.

Algorithm to organize table into many tables to have less cells?

I'm not really trying to compress a database. This is more of a logical problem. Is there any algorithm that will take a data table with lots of columns and repeated data and find a way to organize it into many tables with ID's in such a way that in total there are as few cells as possible, and that this tables can be then joined with a query to replicate the original one.
I don't care about any particular database engine or language. I just want to see if there is a logical way of doing it. If you will post code, I like C# and SQL but you can use any.
I don't know of any automated algorithms but what you really need to do is heavily normalize your database. This means looking at your actual functional dependencies and breaking this off wherever it makes sense.
The problem with trying to do this in a computer program is that it isn't always clear if your current set of stored data represents all possible problem cases. You can't only look at numbers of values either. It makes little sense to break off booleans into their own table because they have only two values, for example, and this is only the tip of the iceberg.
I think that at this point, nothing is going to beat good ol' patient, hand-crafted normalization. This is something to do by hand. Any possible computer algorithm will either make a total mess of things or make you define the relationships such that you might as well do it all yourself.

Pitfalls in prototype database design (for performance viability testing)

Following on from my previous question, I'm looking to run some performance tests on various potential schema representations of an object model. However, the catch is that while the model is conceptually complete, it's not actually finalised yet - and so the exact number of tables, and numbers/types of attributes in each table aren't definite.
From my (possibly naive) perspective it seems like it should be possible to put together a representative prototype model for each approach, and test the performance of each of these to determine which is the fastest approach for each case.
And that's where the question comes in. I'm aware that the performance characteristics of databases can be very non-intuitive, such that a small (even "trivial") change can lead to an order of magnitude difference. Thus I'm wondering what common pitfalls there might be when setting up a dummy table structure and populating it with dummy data. Since the environment is likely to make a massive difference here, the target is Oracle 10.2.0.3.0 running on RHEL 3.
(In particular, I'm looking for examples such as "make sure that one of your tables has a much more selective index than the other"; "make sure you have more than x rows/columns because below this you won't hit page faults and the performance will be different"; "ensure you test with the DATETIME datatype if you're going to use it because it will change the query plan greatly", and so on. I tried Google, expecting there would be lots of pages/blog posts on best practices in this area, but couldn't find the trees for the wood (lots of pages about tuning performance of an existing DB instead).)
As a note, I'm willing to accept an answer along the lines of "it's not feasible to perform a test like this with any degree of confidence in the transitivity of the result", if that is indeed the case.
There are a few things that you can do to position yourself to meet performance objectives. I think they happen in this order:
be aware of architectures, best practices and patterns
be aware of how the database works
spot-test performance to get additional precision or determine impact of wacky design areas
More on each:
Architectures, best practices and patterns: one of the most common reasons for reporting databases to fail to perform is that those who build them are completely unfamiliar with the reporting domain. They may be experts on the transactional database domain - but the techniques from that domain do not translate to the warehouse/reporting domain. So, you need to know your domain well - and if you do you'll be able to quickly identify an appropriate approach that will work almost always - and that you can tweak from there.
How the database works: you need to understand in general what options the optimizer/planner has for your queries. What's the impact to different statements of adding indexes? What's the impact of indexing a 256 byte varchar? Will reporting queries even use your indexes? etc
Now that you've got the right approach, and generally understand how 90% of your model will perform - you're often done forecasting performance with most small to medium size databases. If you've got a huge one, there's a ton at stake, you've got to get more precise (might need to order more hardware), or have a few wacky spots in the design - then focus your tests on just this. Generate reasonable test data - and (important) stats that you'd see in production. And look to see what the database will do with that data. Unless you've got real data and real prod-sized servers you'll still have to extrapolate - but you should at least be able to get reasonably close.
Running performance tests against various putative implementation of a conceptual model is not naive so much as heroically forward thinking. Alas I suspect it will be a waste of your time.
Let's take one example: data. Presumably you are intending to generate random data to populate your tables. That might give you some feeling for how well a query might perform with large volumes. But often performance problems are a product of skew in the data; a random set of data will give you an averaged distribution of values.
Another example: code. Most performance problems are due to badly written SQL, especially inappropriate joins. You might be able to apply an index to tune an individual for SELECT * FROM my_table WHERE blah but that isn't going to help you forestall badly written queries.
The truism about premature optimization applies to databases as well as algorithms. The most important thing is to get the data model complete and correct. If you manage that you are already ahead of the game.
edit
Having read the question which you linked to I more clearly understand where you are coming from. I have a little experience of this Hibernate mapping problem from the database designer perspective. Taking the example you give at the end of the page ...
Animal > Vertebrate > Mammal > Carnivore > Canine > Dog type hierarchy,
... the key thing is to instantiate objects as far down the chain as possible. Instantiating a column of Animals will perform much slower than instantiating separate collections of Dogs, Cats, etc. (presuming you have tables for all or some of those sub-types).
This is more of an application design issue than a database one. What will make a difference is whether you only build tables at the concrete level (CATS, DOGS) or whether you replicate the hierarchy in tables (ANIMALS, VERTEBRATES, etc). Unfortunately there are no simple answers here. For instance, you have to consider not just the performance of data retrieval but also how Hibernate will handle inserts and updates: a design which performs well for queries might be a real nightmare when it comes to persisting data. Also relational integrity has an impact: if you have some entity which applies to all Mammals, it is comforting to be able to enforce a foreign key against a MAMMALS table.
Performance problems with databases do not scale linearly with data volume. A database with a million rows in it might show one hotspot, while a similar database with a billion rows in it might reveal an entirely different hotspot. Beware of tests conducted with sample data.
You need good sound database design practices in order to keep your design simple and sound. Worry about whether your database meets the data requirements, and whether your model is relevant, complete, correct and relational (provided you're building a relational database) before you even start worrying about speed.
Then, once you've got something that's simple, sound, and correct, start worrying about speed. You'd be amazed at how much you can speed things up by just tweaking the physical features of your database, without changing any app code. To do this, you need to learn a lot about your particular DBMS.
They never said database development would be easy. They just said it would be this much fun!

If you were organizing books in a library, how would you store them and what data structure would you use?"

I would use a hash table and use ISBN number as key. As this will give me a look up time of O(1)....as avg time of look up in hash table is O(1)....
we can also use Binary search tree.....look up time is O(nlogn)...
What data structure would you guys use and why?
This sounds like a homework or interview question. If I were asking it, I would be interested in more than just whether you understand a couple of data structures. I would also want to know how you analyze a real-world problem and translate it to the world of computers and data structures.
As such, you should probably think about what operations you need to perform on the data before you pick a data structure. You should also think some about real libraries and some of the "gotchas" that could come up with any data structure you chose.
If all you need to do is translate from an ISBN to the catalog entry for the corresponding book, then a hash table might be a reasonable choice. But you might want to think about how you would deal with popular books, such as best sellers, that a library could have many copies of.
But is ISBN lookup really the important use case? I use my local library all the time, and I never look up books by ISBN. Some of things that I do are:
Look up a specific book by title. Sometimes there are different books with the same title.
Browse the list of books by an author I like
Find where books on a particular subject are shelved, so I can browse them.
Librarians probably have additional uses for a catalog system:
Add new books to the catalog
Mark books as checked out
Change listing information, such as subject classification, for a book
So I guess my recommendation would be to think more carefully about what problem you want to solve before you decide on the solution.
Apologies for asking more questions instead of providing an answer. I hope this is helpful anyway.
Well ... I don't think the hardest problem to solve with designing a data structure to store information about books is that of look-up speed.
And I would certainly not settle for a system that only allowed searching if you know the ISBN. What if you only remember the author, or a few words from the title? If there is to be any gains in having a computerized system for this, you must support flexible searches, in my opinion.
I would probably look into using Dublin Core, but I'm not at all sure that's the "right" thing to do. It seems people have spent a great deal of time thinking about that one, though.

Resources