In Oracle's documentation, for the estimator in optimizer, there is a schema like this:
https://docs.oracle.com/database/121/TGSQL/img/GUID-22630970-B584-41C9-B104-200CEA2F4707-default.gif
Normally, as I know, the plan generator generates the plans and handles these plans to the estimator consecutively, to estimate their costs. However, in this schema, after the query transformer, the query is directly passed to the estimator. But there is not any plan yet.
My question is, what happens when the query is first handled to the estimator from the query transformer? Because there is no plan yet. So how it calculates the cost? Or does it directly pass it to the plan generator without any cost in the first time?
Thanks in advance.
The estimator is involved in the query optimizer process. Its main task is to measure the plans that give the plan generator.
The end goal of the estimator is to estimate the overall cost of a given plan. If statistics are available,(If you notice in the statistics in the image next to it) then the estimator uses them to compute the measures. The statistics improve the degree of accuracy of the measures.
Related
I am confused about what significance method to use to test my data. I have been following the paper below to reach representative Lighthouse data, which states that a median should be aggregated after five runs of Lighthouse. Could someone help me undestand how I should be statistically analysing the data in my excel table if I'm only left with two medians?... t-test? Mood's Median test?
https://pdfs.semanticscholar.org/7c3e/c7ea067b9b3c29415daaa98f6f886bad6676.pdf
I am working on modeling the understory forest using the RandomForest classifier. The results are the probability values of understory tree occurrence. And I have an independent dataset, which was not utilized in model building. I want to test how reliable the prediction model is against the field data.
I would like to know what statistics should I use to do it? I was thinking to use a t-test but I doubt it is good statistics. I wonder if I can use kappa statistics or agreement statistics but I am not so sure about it. I hope someone can help me with this. Thank you.
Our team is currently evaluating Neo4j, and graph databases as a whole, as a candidate for our backend solution.
The upsides - the flexible data model, fast traversals in a native graph store - are all very applicable to our problem space.
However, we also have a need to perform large scale aggregations on our datasets. I'm testing a very simple use case with a simple data model: (s: Specimen)-[d: DONOR]->(d: DONOR)
A Specimen has an edge relating it to a Donor.
The dataset I loaded has ~6 million Specimens, and a few hundred Donors. The aggregation query I want to perform is simple:
MATCH (s: Specimen)-[e: DONOR]->(d: Donor)
WITH d.sex AS sex, COUNT(s.id) AS count
RETURN count, sex
The performance time is very slow - the result does not return for ~9 seconds. We need sub-second return times for this solution to work.
We are running Neo4j on an EC2 instance with 32vCPU units and 256GB of memory, so compute power shouldn't be a blocker here. The database itself is only 15GB.
We also have indexes on both the Specimen and Donor nodes, as well as an index on the Donor.sex property.
Any suggestions on improving the query times? Or are Graph Databases simply not cut out for such large-scale aggregations?
You will more than likely need to refactor your graph model. For example, you may want to investigate if using multiple labels (e.g. something like Specimen:Male/Specimen:Female) if it is appropriate to do so, as this will act as a pre-filter before scanning the db.
You may find the following blog posts helpful:
Modelling categorical variables
Modelling relationships
Modelling flights, which talks about dealing with dense nodes
I have dataset of gold prices and after modifying and some preprocessing i ended up with dataframe below:
There is 50,000 record in dataset and there are morethan 500 different markets with different frequencies, all columns expect date are int type and date is datetime object. i need to predict price per unit in some specific dates. but somehow i baffled with so many methods.
My question is what regression algorithm/method is results good prediction for this kind of data ?
In machine learning or data mining as they always say, a lot of things can be done in a lot of ways. Lets try to use elimination to decide on the algorithm for the given problem.The primary case is that the class variable (feature to be predicted) is continuous hence you should use any regression algorithms. I would suggest to go with linear regression, check the accuracy using r^2 score which is basically a squared difference between an actual and a predicted value. If it is not on par, try randomforest regressor.
I know this has been asked many times before but none has a definte answer.
Here
Here
In my case my DBA has optimized a select query which takes around 1.05 mins to execute. I made further enhancements by making in run within 1sec. But the one which I optimized cost more in Execution plan. My DBA is suggesting since my query is costlier we should not change the query.
I know that Exection Plans for two queries are not comparable. But how should I convince my DBA to understand that Execution Plan is a "Plan" but results are "Actual"
Could anyone point me in the right direction
As Tom Kyte says:
You cannot compare the cost of 2 queries with eachother. They are simply not
comparable.
...
If we get another query -- another similar query -- we go through the same steps, build
lots of plans, assign a cost, pick the one with the lowest cost from that set.
We cannot however take those 2 queries and compare the relative cost of each to the
other. Those costs are specific to those queries.
Many things influence the "cost" of a query. The "cost" is just some artificial number
we arrived at to select a query given a certain environment. ...
Don't compare them, you cannot -- they might as well be random numbers.
A DBA should already know that, as should anyone trying to tune queries (and you said in the question that you do). They should also know to trust Tom's opinion.
the calculated costs are results from statistics calculated in the past and also depend on configuration. For example if you set the optimizer_index_cost_adj , then it will effect the costs but not the actual time it takes to execute the query. And statistics that are a second old are already a second old and not 100% accurate anymore.
On the other hand, when you measure the execution time of your query, then the measured time is subject to your cache hit ratio. You can of course flush all caches before every query, but this would not resemble a live situation. So be careful when you claim that your statement takes 1.00 seconds instead of 1.05. Because you might just be doing experiments in your lab that isn't your reality.