Which data structure (full HABTM ?) - data-structures

I just want to know which of this two structures is the best for better performance:
Full HABTM (many to many), but with a very big join table (so only one association for query)
Or, HABTM + 1 hasOne (which reduce significantly the join table rows (equal to number of main entity, approximatively 50 000rows) ? But this method forced me to do a query with 2 associations.
So to sum up. Should I use a query with a single association but a big join table (120 000rows) or use a query for 2 associations but with a more lightweight join table ?

After some reflexions, I think I will use a single join.
I have just seen some benchmarks that shows single join will be more efficient than multiple queries even if the join table contain lot of rows. It's also less development effort for my application code.
I remain at your listening for any advice.
Thanks

Related

How to deal with 1 to many SQL (Table inputs) in Pentaho Kettle

I have a situation where in i have the following tables.
Employee - emp_id, emp_name, emp_address
Employee_assets - emp_id(FK), asset_id, asset_name (1-many for employee)
Employee_family_members - emp_id(FK), fm_name, fm_relationship (1-many for employee)
Now, I have to run a scheduled kettle job which reads in the data from these tables in say batches of 1000 employees and create a XML output for those 1000 records based on the relationship in DB with family members and assets. It will be a nested XML record for every employee.
Please note that the performance of this kettle job is very crucial in my scenario.
I have two questions here -
What is the best way to pull in records from the database for a 1-many relationship in schema?
What is the best way to generate the XML output structure given that XML join steps are a performance hit?
To pull data in you can use multiple db lookup fields or a Database Join step. Performance wise I would think that the join would likely be faster but that's all dependent on the complexity of the query you use and how it's written etc.
Here is how I have achieved this.
So, there is one Table Input step to read the base table and subsequently create the XML chunk for it. Subsequently, in the flow, I am using the 1-many relationship (child table) as another Database join step passing the relationship key to it. Once the data is pulled out, the XMLs are generated for the child rows. This is then passed on to the Modified Java Script Value step(merge rows) which then merges the content using trans_Status = SKIP_TRANSFORMATION for similar rows. Once similar rows are merged/concatenated, the putRow(row) is used to dump it out as an output to the next step.
Please note, that this required the SQL to have order by/sorted based on the relationship keys. This is performing alright, so I can proceed with it.

typed data set; parent/child select and update with ONE trip to the database (for each op)?

Is it possible, using an ADO.NET typed DataSet containing two tables in a parent/child relationship, to populate the DataSet with ONE trip to the d/b (query could return one or two tables; if one, then result set has columns from both tables, right?), and to update the d/b with ONE trip to the d/b (call to generated stored proc, I guess).
By "is it possible", I mean is it possible to have Visual Studio (2012) automagically generate the classes and SQL code to make this happen?
Or am I kind of on my own? It's looking an awful lot like VS really wants to generate one d/b server round trip for each table involved.
*I guess the update stored proc would have to take table-typed parameters from both parent and child, and perform inserts/updates/deletes appropriately.
Yes, one round trip per table is the way to go.
(- It's certainly possible to use a join query to populate a datatable but VS will then be reluctant to generate update etc SQL. This may or may not be a problem, depending on what you intend to do with the dataset.)
But if you have two tables in a dataset, lets say customers - orders, then you would typically use two queries, and two trips to the db:
SELECT * FROM customers WHERE customers.customerid=#customerid
and
SELECT * FROM orders WHERE orders.customerid=#customerid
Somewhat more counter-intuitive is the situation where you want all customers and orders for one country:
SELECT * FROM customers WHERE customers.countryid=#countryid
and
SELECT orders.* FROM orders INNER JOIN customers ON customers.customerid=orders.customerid WHERE customers.countryid=#countryid
Note how the join query returns data from only one table, but uses the join to identify which rows to return.
Then, once you have the data in your dataset, you can navigate it using the getparentrow and getchildrows methods. This is how ADO.Net manages hierarchical data.
You do need this one-table-at-a-time approach, because, assuming you have foreign key constraints in your db, you need to insert and update in reverse order from delete.
EDIT Yes, this does mean that in some circumstances, depending on the data you want and the structure of your primary keys, you could end up with a humungous set of JOINS that still only pull the data from the table at the end of the hierarchy. This might seem wrong in terms of traditional SQL, but actually it's fine. The time you have lost in the multiple, more complex queries is saved by the reduced amount of data you have to pull back across the wire, compared with one big join query that would be returning multiple copies of the parent data.

Left Join 1 to 1/0 with llblgen?

With EF, if you navigate to a singular related entity within a select projection(such as from the many side of a many-to-one or 1-to-1/0) it would coalesce nulls and give you a left join: https://stackoverflow.com/a/2525950/84206
Since it occurs in a project and not in a join, EF makes a pretty reasonable assumption that a left join is desired.
However, I haven't found a way to accomplish this in LINQ with LLBLGen. The above technique produces an inner join with LLBGen. I can't use techniques that use DefaultIfEmpty because that's only available when navigating into a many relationship.
I am hoping to avoid using WithPath/Prefetch because I'd really like to do the projection in LINQ instead of grabbing a huge object graph into memory and do the projection in memory.
This is LLBLGen 3.5.
If the FK is nullable, the join will be a left join. If the FK isn't nullable, it will be an inner join. This is the only way it's determinable what you want as Linq lacks any other system to specify the join type in this. Your link must use a nullable (optional) FK side as well to get a left join.
If nothing helps, please use queryspec, the query api will allow you to specify the join type in any case.
ps: please next time post on our forums, we don't monitor SO every day, but we do monitor our forums.

Struggling with model relationships

I'm having a hard time designing a relationship with a few models in my project.
The models are: band, musician, instrument
Bands have multiple musicians
Musicians have multiple bands and multiple instruments
That’s all pretty straightforward, but I also need to keep track of what instruments a musician has for a particular band. So in a sense, I guess, bands have multiple instruments via the musicians.
In the tables, I was going to add instrument_id to the bands_musicians linking table, but I need a musician to be able to have multiple instruments for a band, so I was thinking it would need to go in the musicians_instruments table.
What's the best way to set up the relationships with these models?
Thanks for your time!
Musicians would have a one-to-many relationship with both bands and instruments. So create your musicians table and add all of the information relavent to the musicians themselves into that table.
Create an instruments table to hold information about instruments, and do the same for the bands. That will take care of all of your individual items.
Then create something like 'band_assignments' table that just has the id of a band and the id of a musician and links the two together. Create an 'instrument_assignment' table to do the same thing.
Now when you query a musician you can left join all of these tables together to get the data that you need or selectively join on just instruments, just bands, or sort by 'join date' and limit to get the last band they joined or the last instrument they learned.
Basically 5 tables should cover it all.
musicians (musician_id, first_name, last_name)
bands (band_id, name)
instruments (instrument_id, name)
band_instument_assignments (musician_id, band_id, instrument_id, date_played)
As you can see in the edited version above you will have multiple rows in the 'band_instrument_assignments' table--one for each instrument that each user played in each band. You will need to use some GROUP BY and LIMIT clauses to get the data you want, but it should work for you.
See:
How to handle a Many-to-Many relationship with PHP and MySQL
That should give you an idea on how to go about designing your database structure.
someoneinomaha
Maybe you need 4th model, which will cover and union all of her children entities, e.g. called like 'Mus Model'(or whatever you want) and have some methods like:
get_bands()
get_instruments()
get_musicians()
get_instruments_by_musician()
get_musicians_by_band()
get_instruments_by_band()
get band_by_musician()
and so on...It'll provide you needed data and will not brake entities relationships, imho.
I might be a little late to the party here and I am no database expert but I have found that drawing out your DB schema helps immensely. Just make boxes and fill in your table names and columns then draw arrows to define your relationships and it should be a lot clearer as to how you should structure things and whether you need to add a table to join two other tables.
If all else fails, just copy a schema from databaseanswers.org. I'm sure there is one there that would probably help you.

How are application like twitter implemented?

Suppose A follows 100 person,
then will need 100 join statement,
which is horrible for database I think.
Or there are other ways ?
Why would you need 100 Joins?
You would have a simple table "Follows" with your ID and the other persons ID in it...
Then you retrieve the "Tweets" by joining something like this:
Select top 100
tweet.*
from
tweet
inner join
followers on follower.id = tweet.AuthorID
where
followers.masterID = yourID
Now you just need a decent caching and make sure you use a non locking query and you have all information... (Well maybe add some userdata into the mix)
Edit:
tweet
ID - tweetid
AuthorID - ID of the poster
Followers
MasterID - (Basically your ID)
FollowerID - (ID of the person following you)
The Followers table has a composite ID based on master and followerID
It should have 2 indexes - one on "masterID - followerID" and one on "FollowerID and MasterID"
The real trick is to minimize your database usage (e.g., cache, cache, cache) and to understand usage patterns. In the specific case of Twitter, they use a bunch of different techniques from queuing, an insane amount of in-memory caching, and some really clever data flow optimizations. Give Scaling Twitter: Making Twitter 10000 percent faster and the other associated articles a read. Your question about how you implement "following" is to denormalize the data (precalculate and maintain join tables instead of performing joins on the fly) or don't use a database at all. <-- Make sure to read this!

Resources