What Cannot be Expressed by Relational Algebra? - relational-algebra

What are some queries that cannot be expressed by the basic operations of relational algebra?
What I know so far are that aggregate functions like average, total number cannot be expressed? What others are there?

Any query for which [generalized] transitive closure is required.
Is there a flight path from Capetown to Wellington ?
How long will a train ride from London to Sofia take, how many times will I have to change trains and how long will I be kept waiting in the train station ?

Related

Time series / state space model conceptual

I want to predict a value. I have a time series as well as a bunch of other time series that may be interesting to use to augment the prediction.
Someone is arguing with me that it is the same thing to find the correlation between 2 non stationary time series and finding the correlation when making both stationary by some sort of differencing. Their logic is that a state space model doesn't care.
Isn't the whole idea of regression to exploit correlations to predict values? Doesn't there have to exist a correlation to incorporate an explanation of variance in the data and not increase the variance in the predictions? Also, I am 100% convinced that finding the correlation between two non stationary time series without doing anything is wrong.... And you'll end up with correlations to time and not the variables themselves.
Any input is helpful. Thanks.
Depends on the models you're employing later on. You say that there has to exist a correlation or else the variance in the predictions will increase. That might hold for some models. Rather, I'd recommend you to go for models that have some model-election in themselves.
Think of LASSO, for example, that gives sparse vectors for the coefficients. Or think of a model that allows you to calculate Variable Importance and base your decisions on that outcome.
Second, let's do some math:
Correlation original = E[X(t)*Y(t)]
Correlation differencing = E[(X(t)-X(t-1))*(Y(t)-Y(t-1))] = E[X(t)Y(t)] + E[X(t-1)Y(t)] + E[X(t-1)Y(t-1)] + E[X(t)Y(t-1)]
If you assume that one time series is not correlated with the other time-series previous sample, then this reduces to
= E[X(t)Y(t)] + E[X(t-1)Y(t-1)]

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm.
I have following code in java
// Build the recommendation model using ALS
int rank = 10;
int numIterations = 10;
MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings),
rank, numIterations, 0.01);
I have read some where that it is the number of latent factors in the model.
Suppose I have a dataset of (user,product,rating) that has 100 rows. What value should be of rank (latent factors).
As you said the rank refers the presumed latent or hidden factors. For example, if you were measuring how much different people liked movies and tried to cross-predict them then you might have three fields: person, movie, number of stars. Now, lets say that you were omniscient and you knew the absolute truth and you knew that in fact all the movie ratings could be perfectly predicted by just 3 hidden factors, sex, age and income. In that case the "rank" of your run should be 3.
Of course, you don't know how many underlying factors, if any, drive your data so you have to guess. The more you use, the better the results up to a point, but the more memory and computation time you will need.
One way to work it is to start with a rank of 5-10, then increase it, say 5 at a time until your results stop improving. That way you determine the best rank for your dataset by experimentation.

Number of segments that cover a point

I'm designing a web service that calculates the number of online users of an arbitrary system.
The input data is the array of tuples (user_id, log_in_time, log_out_time). The service should index this data somehow and prepare data structures in order to efficiently answer the requests of the form: "How many users were online at every time point in (start_time, end_time)?". The response of the service is an array -- number of online users for each time point in the requested interval.
Complication: each user has a set of characteristics (i.e. age, gender, city). Is it possible to efficiently answer the request of the form: "How many users with age=x, city=y, gender=z were online at every time point in (start_time, end_time)?"
The time is an integer (timestamp).
I'm not going to answer this question fully because clearly it is a homework assignment, but you didn't declare it as such.
Assuming the time windows are small or the number of simultaneous online users within that window is small, simply solve the first problem, then filter by your demographic criteria.
If the number of simultaneous online users is large and filtering after the fact is too time consuming, then use something similar to a boost::multi_index to filter on the most sparse dimension first, then do your time range query.
Additionally, most relational databases will do these types of queries out of the box, so the simplest solution would be to store your data in a database with proper indexes and then create the very straightforward query.
Since your comment said that you didn't understand how to use a B-tree to do a range query, I'll explain it in my answer. You use a B-tree to look up the minimum of your time range query. The way a B-tree is structured is that successive leaves are adjacent to one another. You first do a logarithmic lookup on the minimum range query bound. This finds you the first point within that time range. Then, you do a linear scan from the starting point to the point where you exceed your maximum bound for your range query.
This means using a B-tree makes your query O(log(number_of_online_users) + length_of_time_interval).

How does Oracle calculate the cost in an explain plan?

Can anyone explain how the cost is evaluated in an Oracle explain plan?
Is there any specific algorithm to determine the cost of the query?
For example: full table scans have higher cost, index scan lower... How does Oracle evaluate the cases for full table scan, index range scan, etc.?
This link is same as what I am asking: Question about Cost in Oracle Explain Plan
But can anyone explain with an example, we can find the cost by executing explain plan, but how does it work internally?
There are many, many specific algorithms for computing the cost. Far more than could realistically be discussed here. Jonathan Lewis has done an admirable job of walking through how the cost-based optimizer decides on the cost of a query in his book Cost-Based Oracle Fundamentals. If you're really interested, that's going to to be the best place to start.
It is a fallacy to assume that full table scans will have a higher cost than, say, an index scan. It depends on the optimizer's estimates of the number of rows in the table and the optimizer's estimates of the number of rows the query will return (which, in turn, depends on the optimizer's estimates of the selectivity of the various predicates), the relative cost of a sequential read vs. a serial read, the speed of the processor, the speed of the disk, the probability that blocks will be available in the buffer cache, your database's optimizer settings, your session's optimizer settings, the PARALLEL attribute of your tables and indexes, and a whole bunch of other factors (this is why it takes a book to really start to dive into this sort of thing). In general, Oracle will prefer a full table scan if your query is going to return a large fraction of the rows in your table and an index access if your query is going to return a small fraction of the rows in your table. And "small fraction" is generally much smaller than people initially estimate-- if you're returning 20-25% of the rows in a table, for example, you're almost always better off using a full table scan.
If you are trying to use the COST column in a query plan to determine whether the plan is "good" or "bad", you're probably going down the wrong path. The COST is only valid if the optimizer's estimates are accurate. But the most common reason that query plans would be incorrect is that the optimizer's estimates are incorrect (statistics are incorrect, Oracle's estimates of selectivity are incorrect, etc.). That means that if you see one plan for a query that has a cost of 6 and a plan for a different version of that query that has a cost of 6 million, it is entirely possible that the plan that has a cost of 6 million is more efficient because the plan with the low cost is incorrectly assuming that some step is going to return 1 row rather than 1 million rows.
You are much better served ignoring the COST column and focusing on the CARDINALITY column. CARDINALITY is the optimizer's estimate of the number of rows that are going to be returned at each step of the plan. CARDINALITY is something you can directly test and compare against. If, for example, you see a step in the plan that involves a full scan of table A with no predicates and you know that A has roughly 100,000 rows, it would be concerning if the optimizer's CARDINALITY estimate was either way too high or way too low. If it was estimating the cardinality to be 100 or 10,000,000 then the optimizer would almost certainly be either picking the table scan in error or feeding that data into a later step where its cost estimate would be grossly incorrect leading it to pick a poor join order or a poor join method. And it would probably indicate that the statistics on table A were incorrect. On the other hand, if you see that the cardinality estimates at each step is reasonably close to reality, there is a very good chance that Oracle has picked a reasonably good plan for the query.
Another place to get started on understanding the CBO's algorithms is this paper by Wolfgang Breitling. Jonathan Lewis's book is more detailed and more recent, but the paper is a good introduction.
In the 9i documentation Oracle produced an authoratative looking mathematical model for cost:
Cost = (#SRds * sreadtim +
#MRds * mreadtim +
#CPUCycles / cpuspeed ) / sreadtim
where:
#SRDs is the number of single block reads
#MRDs is the number of multi block reads
#CPUCycles is the number of CPU Cycles *)
sreadtim is the single block read time
mreadtim is the multi block read time
cpuspeed is the CPU cycles per second
So it gives a good idea of the factors which go into calculating cost. This was why Oracle introduced the capability to gather system statistics: to provide accurate values for CPU speed, etc
Now we fast forward to the equivalent 11g documentation and we find the maths has been replaced with a cursory explanation:
"Cost of the operation as estimated by the optimizer's query approach.
Cost is not determined for table access operations. The value of this
column does not have any particular unit of measurement; it is merely
a weighted value used to compare costs of execution plans. The value
of this column is a function of the CPU_COST and IO_COST columns."
I think this reflects the fact that cost just isn't a very reliable indicator of execution time. Jonathan Lewis recently posted a pertinent blog piece. He shows two similar looking queries; their explain plans are different but they have identical costs. Nevertheless when it comes to runtime one query performs considerably slower than the other. Read it here.

Which is more efficient - Computing results using a functionin realtime or reading the results directly from a database?

Let us take this example scenario:
There exists a really complex function that involves mathematical square roots and cube roots (which are slower to process) to compute its output. As an example, let us assume the function accepts two parameters a and b and the input range for both the values a and b are well-defined. Let us assume the input values a and b can range from 0 to 100.
So essentially fn(a,b) can be either computed in real time or its results can be pre-filled in a database and fetched as and when required.
Method 1: Compute in realtime
function fn(a,b){
result = compute_using_cuberoots(a,b)
return result
}
Method 2: Fetch the function result from a database
We have a database pre-filled with the input values mapped to the corresponding result:
a | b | result
0 | 0 | 12.4
1 | 0 | 14.8
2 | 0 | 18.6
. | . | .
. | . | .
100 | 100 | 1230.1
And we can
function fn(a,b){
result = fetch_from_db(a,b)
return result
}
My question:
Which method would you advocate and why? Why do you think one method is more efficient than the other?
I believe this is a scenario that most of us will face at some point during our programming life and hence this question.
Thank you.
Question Background (might not be relevant)
Example : In scenarios like Image-Processing, it is possible to come across such situations more often, where the range of values for the input (R,G,B) are known (0-255) and mathematical computation of square-roots and cube-roots introduce too much time for the server requests to be completed.
Let us take for an example you're building an app like Instagram - The time taken to process an image sent to the server by the user and the time taken to return the processed image must be kept minimal for an optimal User-Experience. In such situations, it is important to minimize the time taken to process the image. Worse yet, scalability problems are introduced when the number of such processing requests grow large.
Hence it is necessary to choose between one of the methods described above that will also be the most optimal method in such situations.
More details on my situation (if required):
Framework: Ruby on Rails, Database: MongodB
I wouldn't advocate either method, I'd test them both (if I thought they were both reasonable) and get some data.
Having written that, I'll rise to the bait: given the relative speed of computation vs I/O I would expect computation to be faster than retrieving the function values from a database. I'll acknowledge the possibility (and no more) that in some special cases an in-memory database will be able to outperform (re-)computation, but as a general rule, no.
"More efficient" is a fuzzy term. "Faster" is more concrete.
If you're talking about a few million rows in a SQL database table, then selecting a single row might well be faster than calculating the result. On commodity hardware, using an untuned server, I can usually return a single row from an indexed table of millions of rows in just a few tenths of a millisecond. But I'd think hard before installing a dbms server and building a database only for this one purpose.
To make "faster" a little less concrete, when you're talking about user experience, and within certain limits, actual speed is less important than apparent speed. The right kind of feedback at the right time makes people either feel like things are running fast, or at least makes them feel like waiting just a little bit is not a big deal. For details about exactly how to do that, I'd look at User Experience on the Stack Exchange network.
The good thing is that it's pretty simple to test both ways. For speed testing just this particular issue, you don't even need to store the right values in the database. You just need to have the right keys and indexes. I'd consider doing that if calculating the right values is going to take all day.
You should probably test over an extended period of time. I'd expect there to be more variation in speed from the dbms. I don't know how much variation you should expect, though.
Computing results and reading from a table can be a good solution if inputs are fixed values. Computing real time and caching results for an optimum time can be a good solution if inputs varies in different situations.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" Donald Knuth
I'd consider using a hash as a combination of calculating and storing. With he really complex function represented as a**b:
lazy = Hash.new{|h,(a,b)|h[[a,b]] = a**b}
lazy[[4,4]]
p lazy #=> {[4, 4]=>256}
I'd think about storing the values on the code itself:
class MyCalc
RESULTS = [
[12.4, 14.8, 18.6, ...]
...
[..., 1230.1]
]
def self.fn a, b
RESULTS[a][b]
end
end
MyCalc.fn(0,1) #=> 14.8

Resources