Why am I getting all the pizzas(relational algebra) and my joins messing up? - relation

This is the database I am using for my queries
https://class.stanford.edu/c4x/DB/RA/asset/pizzadata.html
the syntax for writing out relational algebra queries is based off http://www.cs.duke.edu/~junyang/ra/ .
My query is to "Find all pizzas eaten by at least one female over the age of 20."
this is what I have so far
\project_{name,pizza}(
Person \join_{gender='female' and age>20} Eats
)
I think I have the right logic here.("\join_{cond} is the relational theta-join operator.") I also showed the name column for debugging purposes. I am joining two relations and only keeping the rows where gender is female and age is > 20.
The result of my query(against the correct query). I don't think this is a syntax issue. In the Eats relation, Fay only eats mushroom. I don't understand why she is paired with every pizza combination

Theta joins are cartesian; they join every row of each table with every row of every other table. In your example you are joining every row of Person where gender='female' and age>20 with every row of Eats, regardless of name. You probably want:
Person \join_{gender='female' and age>20 and name=eater} \rename{eater, pizza} Eats
Note that Thetas typically increase the number of rows; you typically reduce the number of rows returned using Sigmas or selections. A more idiomatic way of performing your statement would be with a Select and natural join:
\select{gender='female' and age>20} Person \join Eats

Related

Is Gorm Preload function a good practice in terms of performance?

Why Preload function in Gorm is a “good practice” since it retrieves all the records of a given table (eager loading) every time it is called?
I saw a lot of people referencing the Preload as a good solution when working with relations and I don’t understand why.
I have no idea what you are talking about;
but generally; the computational cost is different for joins / vs lookups.
given two tables: kake and topping:
kake (name str)
topping (name str)
if you join, you will have a much larger set (disk input-output, memory) than in if you do matches;
because you need to calculate ALL PERMUTATIONS.
table snapshot
cake:
1|napoleon
2|chocolate
3|cheese
topping:
1|butter
2|frosting
3|cacao
4|white cacao
5|goat chese
6|cow chese
7|chinese chese
...|nuts
...|avocado
...|white chocolate
11:cherry-flavor
query logic
with a general join you will have results 3 x 11 - all cakes times all toppings.
this may seem trivial, but it is not when the tables have 1000+ records.
with "preload" you will have;
get all topings for napoleon => only frosting
get all topings for chocolate => only cacao
get all toppings for cheese => only cow chese and only chinise chese
then; given my napoleon + chocolate + cheese,
you can avoid: butter, cherry flavor
so you select only the relevant related records.
this is not simple.
it causes other problems.
but generally, performance is better if you can say:
I need all x,y,z and never a,b,c.
I hope this makes sense.

How to setup CosmosDB when need to search for "like" in string tags

I have a 3 tables structure, Customer, Invoice, InvoiceItem that I would like to try to move from the relational DB and store it in CosmosDB. Currently, there are quite intensive queries being run on the InvoiceItem table. This InvoiceItem table has up to 10 optional TagX columns that are basically text that might include the brand, group, type, or something that would group this InvoiceItem and make it searchable by saying (simplified):
SELECT * FROM InvoiceItem WHERE Tag1 LIKE '%shirt%' AND Tag2 LIKE '%training%'
A query like this on a multi-million table can take more than 8 minutes. We are working on the archiving strategy and indexes to speed up the process but it looked to me like CosmosDB could be worth trying in this case, since all of the data is write-once-read-many scenario.
Back to CosmosDB, how do I deal with those string tags in CosmosDB. As a start, I thought about having Invoice and InvoiceItem in the same partition with "type" property that would differ them. But then I cannot stick the tags anywhere so they would be easily searchable. Any ideas on how to set it up?
Thanks!
Textbook database performance issue caused by either lack of, or inefficient indexing.
With that many rows, index cardinality becomes important. You don't want to index the entire field, you only want to index the first n characters of the columns you're indexing, and only index columns you are searching, whether by join or direct where clauses.
The idea is to keep the indexes as small as possible, while still giving you the query performance you need.
With 18 million rows you probably want to start with an index cardinality of the square root of 18m.
That means to hit the index segment you need, you only want to search no more than 5000 index rows, each of which have 400-5000 rows in their segment, at least for sub-second result times.
indexing the first 3-4 letters would be a good starting point. Based on the square root of 18000000 being 4242 and the nearest exponent of 26(3) (assuming alpha characters only) overshooting that. Even if alpha-numeric, 3 characters is still a good starting point.
If the queries then run super fast, but the index takes forever to build, drop a character. This is called "index tuning". You pick a starting point and find the largest cardinality (lowest number of characters indexed) that gives you the performance you need.
If I'm way off because index performance in this DB is way off the mark of a relational db, you'll need to experiment.
As far as I'm concerned, a select query that takes more than a few seconds is unacceptable, except in rare cases. I once worked for a security company. Their license management system took minutes to pull large customers.
After indexing the tables correctly the largest customer took less than 2 seconds. I had to sift through a table with billions of rows for number of downloads, and some of these queries had 7 joins.
If that database can't do this with 18m rows, I'd seriously consider a migration to a better architecture, hardware, software or otherwise.
As index cardinality increases, the performance gains drop to negative as the index cardinality approaches table cardinality, as compared to no index.
As in all things in life, moderation. At the other end of the spectrum, an index with a cardinality of 2 is just about useless. Half of 8 minutes is 4 minutes, assuming a nearly equal distribution.... useless, so indexing a boolean field isn't a great thing to do, usually. There are few hard and fast rules though. Lots of edge cases. Experimentation is your friend.

what is skewed column in Oracle

I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?
Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.
Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.

performance for sum oracle

I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.

Database and EF performance concern?

I have a basically sql select question that people gave me different answers over the years. Say I have a couple of tables designed each with over 40 columns and potentially will hold ten and thousands of row, I'm using SqlServer2005.
On joining these tables, in the where clause if I have things like
select * from t1, t2
where t1.UserID = 5
and t1.SomeID = t2.SomeOtherID
some people say you should alwasys have the constant "t1.UserID = 5" up front rather than after the "t1.SomeID = t2.SomeOtherID", it boosts the select performance. While others say it doesn't matter.
What is the correct answer?
Also, if I use ADO.NET Entity Framework to implement my DAL, will modeling tables that have over 40 columns and doing CRUD operations be a performance issue to it?
Thank you,
Ray.
In general, with database optimization, you should write SQL which is conceptually correct first, and then tweak performance if profiling shows it to be necessary. When doing an inner join, it is better to use SQL-92, explicit INNER JOINs than Cartesian products. So I would begin by writing your SQL as follows:
SELECT *
FROM t1
INNER JOIN t2
ON t1.SomeID = t2.SomeOtherID
WHERE
t1.UserID = 5
The t1.SomeID = t2.SomeOtherID that goes in the ON part of the INNER JOIN, because it expresses the relationship between the two tables. The UserID that goes in the WHERE clause because it is a filter to limit the result set. Writing your SQL in this way gives more information to the database optimizer, because it expresses your intentions about the join versus the filtering.
Now IF you are not getting acceptable performance with this syntax in a real-world database, then do feel free to experiment with moving bits around. But like I said, start with something which is conceptually correct.
With regards to the second part of your question, the most obvious performance implication is that when you select a collection of entities, the Entity Framework needs to bring back all properties for the entities it is materializing. So if you have 40 columns, then you will be pulling that data back over the wire, if you materialize them as entities. It is, however, possible to write LINQ queries which return anonymous types containing only the columns you need. However, to do full CRUD, you will need to return entities.
People's opinion on this will change over time because RDBMS query optimisation has evolved over time, and different RDBMSs will have different approaches. I can't speak for every syste out there but it's really unlikely that in 2008 this is going to make any difference. YMMV if you are interested only in a specific system.
I can tell you that for any recent version of Oracle it makes no difference.
I know this answer is kind of trite, but I would suggest writing benchmarks. Whip up a console app and test it out yourself. Run the query a couple hundred times and see how long it takes for each way.
There is a lot of superstition when it comes to SQL query performance and optimization. Some people do things thinking it is faster but they don't actually check their facts. Also, the way EF or LinqToSql work and interact with the DB may introduce performance differences not evident in SQL.
If you're optimizing code you may also want to use a profiler like RedGate ANTS. Its not free, but it can help a lot to find bottlenecks in your code. Then you can find places in your code to optimize much easier. It's not always your database slowing your apps down. Or sometimes you're executing a fast query, but doing it a jillion times when you could actually be caching the result.
Firstly, construct the query using an explicit JOIN syntax, rather than the cartesian product. It probably won't make any difference performance-wise for any modern optimiser, but it does make the information on how the JOINs work more accessible for the programmers.
SELECT Player.Name, Game.Date
FROM Player
INNER JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
WHERE Game.WinnerFrags > Game.TotalFrags/2
ORDER BY Player.Name
Which will give us all the players sorted by name who have take more frags in a game than all the other players in the game put together, and the dates of the games. Putting both the conditions are in the JOIN probably won't affect performance either, since the optimiser is likely do the filtering as part of the JOIN anyway. It does start to matter for LEFT JOINs though. Lets say we're looking for how many games the week's top ten players have ever won by the margin described above. Since it is possible that some of them have never one this spectacularly, we'll need LEFT JOIN.
SELECT Player.WeekRank, Player.Name, COUNT(Game.*) AS WhitewashCount
FROM Player
LEFT JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
WHERE Player.WeekRank >= 10
AND Game.WinnerFrags > Game.TotalFrags/2
GROUP BY Player.WeekRank, Player.Name
ORDER BY Player.WeekRank
Well, not quite. The JOIN will return records for each game played by a player, or the player data and NULL game data if the player has played no games. These results will get filtered, during or after the JOIN depending on the optimiser's decision, based on the frag criteria. This will eliminate all the records that don't meet the frag criteria. So there will be no records to group for players who have never had such a spectacular win. Effectively creating an INNER JOIN .... FAIL.
SELECT Player.WeekRank, Player.Name, COUNT(Game.*) AS WhitewashCount
FROM Player
LEFT JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
AND Game.WinnerFrags > Game.TotalFrags/2
WHERE Player.WeekRank >= 10
GROUP BY Player.WeekRank, Player.Name
ORDER BY Player.WeekRank
Once we move the frag criteria into the JOIN the query will behave correctly, returning records for all players in the week's top ten, irrespective of whether they've achieved a whitewash.
After all of that, the short answer is:
For INNER JOIN situations it probably doesn't make a to performance difference where you put the conditions. The queries are more readable if you separate the the join and filtering conditions though. And getting a condition in the wrong place can seriously mess up the results of a LEFT JOIN.

Resources