Unexpected behaviour of rand() in MySQL - random

I encountered a very weird result while trying to filter my data using RAND() function.
Suppose i have a table filled with some data:
CREATE TABLE `status_log` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`rank` int(11) DEFAULT 50,
)
Then i do the following simple select:
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50
and have a clear and expected output:
<...skip...>
| 6575476 | 50 | 34.51090244065123 |
| 6575511 | 50 | 67.84258230388404 |
| 6575589 | 50 | 35.68020727083106 |
| 6575644 | 50 | 74.87329251586766 |
| 6575723 | 50 | 67.32584384020961 |
| 6575771 | 50 | 12.009344726809621 |
| 6575863 | 50 | 58.06919518678374 |
+---------+------+-----------------------+
66169 rows in set (2.502 sec).
So, i generate some random data from 0 to 100 and join each result to the table, around 66000 results in total.
Then i want only a (random) part of the data to be shown. It doesn't have any purpose for production, by the way, it's just some artificial test, so let's not discuss it.
select *
from (
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50) t
where thres>rank
order by thres;
After that i get the following:
<...skip...>
| 4396732 | 50 | 99.97966075314177 |
| 4001782 | 50 | 99.98002871869134 |
| 1788580 | 50 | 99.98064143581375 |
| 5300286 | 50 | 99.98275954274717 |
| 146401 | 50 | 99.98552389441573 |
| 4744748 | 50 | 99.98644758014609 |
+---------+------+--------------------+
16449 rows in set (2.188 sec)
It's obvious that for the mean of 50 the expected number of results should be around 33000 out of total 66000. So it seems that the distribution of rand() is biased, correct?
Let's then change > to <:
select *
from (
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50) t
where thres<rank
order by thres;
<...skip...>
| 4653786 | 50 | 49.98035016467827 |
| 6041489 | 50 | 49.980370281245904 |
| 5064204 | 50 | 49.989308742796354 |
| 1699741 | 50 | 49.991373205549436 |
| 3234039 | 50 | 49.99390454030959 |
| 806791 | 50 | 49.99575274996064 |
| 3713581 | 50 | 49.99814410693771 |
+---------+------+----------------------+
16562 rows in set (2.373 sec)
Again 16000! So not the half but the quarter of all results is shown!
It seems that the output of rand() inside the brackets is somehow influenced with the expression outside them. How is this possible?
I can also union it:
select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres<50
UNION ALL
select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres>=50;
The expected number of results has to be somewhere around 66000, but it returns only 33000 or so.
I observe this behavior only when rand() is non-deterministic and is generated dynamically each time. If i do ...select id,rank as rank,(rand(id)*100)... (i.e. make the output of rand() dependent of id), i start getting the expected number of results (33000-ish). The same happens if i precalculate and fill a temporary field in the table.
I also tried making the filtering with rank=30, and the results were ~6000 and ~32000 for < and > respectively.
Version 10.5.8-MariaDB-3, InnoDB

Using a single query with HAVING instead of a subquery with WHERE in the main query seems to work around it.
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50
having thres > rank
order by thres
This appears to be this bug:
RAND() evaluated and filtered twice with subquery

Related

Oracle UNION ALL query takes temp space

I have a query like this:
SELECT * FROM TEST1 LEFT OUTER JOIN TEST2 on TEST1.ID=TEST2.ID
UNION ALL
SELECT * FROM TEST3 LEFT OUTER JOIN TEST4 on TEST3.ID=TEST4.ID;
The behavior I see here is, it first join TEST1 and TEST2 tables (billions of rows) and then stores the output in temp tablespace. Then it joins TEST3 and TEST4 and then saves the output in same temp table. And finally select the records from there to display the result.
This behavior I see in both Redshift and Oracle. I was just wondering why it stores the result in temporary segments after getting result from first SELECT. It's time taking as well as eats up the temp space. Can not it just starts displaying the result after 1st SELECT is finishes and then goes for 2nd one (instead of storing).
This answer is somewhat speculative, because I don't have an Oracle doc reference. By inspection, we can imagine instead that you wanted to run the following query:
SELECT * FROM TEST1 JOIN TEST2
UNION ALL
SELECT * FROM TEST3 JOIN TEST4
ORDER BY some_col;
It should be clear that to apply any set operation like ORDER BY, all the records returned from the union query would need to be in one logical place. A temp table would seem to work.
That you are not using ORDER BY appears to not affect the workflow which Oracle is using.
I can also add another reason why Oracle is insisting on using a temp table here. Suppose it would be possible to write both halves of the union directly to the buffer. But what would happen if, at a later date, the size of the total union query suddenly exceeded what the buffer can hold? The answer is that your database would crash. So, using a temp table is a safe bet which should generally always work.
How do you observe this behaviour? By any chance don't you perform INSERT or CREATE TABLE? That would explain your observation, because at the end, all rows are required.
Also if your client has set an option fetch all rows this could be observed.
But in normal case, where the client is interested in few first rows Oracle returns quickly the first available (array size) rows from the first join ignoring the second one.
You may perform this little Gedankenexperiment:
create table test1 as
select rownum id,
lpad('x',1023,'X') pad
from dual connect by level <= 1000000;
Create analog the table 2 to 4.
Now run your query (adapted to valid syntax)
SELECT * FROM TEST1 CROSS JOIN TEST2
UNION ALL
SELECT * FROM TEST3 CROSS JOIN TEST4;
This returns for my the first page in SQL Developer in ca 30 seconds, which somehow disproves your claim.
Simple calculate the required TEMP space for two 10**6 * 10**6 cartesian join with row lenth 1K - this is far above my TEMP configuration.
The one possible way to observe what is Oracle actualy doing is to run the query with the /*+ gather_plan_statistics */ hint.
Than get the SQL_ID of the statement and check the actual rows A-Rowsin the plan
select * from table(dbms_xplan.display_cursor('a9y62gxagups6',null,'ALLSTATS LAST'));
SQL_ID a9y62gxagups6, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */ * FROM TEST1 CROSS JOIN TEST2
UNION ALL SELECT * FROM TEST3 CROSS JOIN TEST4
Plan hash value: 1763392637
--------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 1 | UNION-ALL | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 2 | MERGE JOIN CARTESIAN| | 1 | 1000G| 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 3 | TABLE ACCESS FULL | TEST1 | 1 | 1000K| 1 |00:00:00.02 | 4 | 28 | 0 | | | |
| 4 | BUFFER SORT | | 1 | 1000K| 50 |00:00:28.49 | 166K| 166K| 142K| 1255M| 11M| 97M (0)|
| 5 | TABLE ACCESS FULL | TEST2 | 1 | 1000K| 1000K|00:00:03.66 | 166K| 166K| 0 | | | |
| 6 | MERGE JOIN CARTESIAN| | 0 | 1000G| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 7 | TABLE ACCESS FULL | TEST3 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 8 | BUFFER SORT | | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | 1103M| 10M| |
| 9 | TABLE ACCESS FULL | TEST4 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
--------------------------------------------------------------------------------------------------------------------------------------
You see, that Oracle
1) full scanned the table2 (row 5)
2) get one row from table1 (row 3)
3) to return to frist 50 rows (row 0)
4) tables 3 and 4 are untached (rows 7 and 9)
You may simple adapt the example to you inner join to see similar results.

MDX Calculate the sum of distinct values

I'm wondering if it is possible to create a calculated member to obtain the sum of distinct values for a fact. I will try to explain it with the following example:
I have a fact where the primary key is related with two dimensions (one to many cardinality). The fact contains a measure and its value is the same for all members of each distinct combination of FACT_ID and DIM_1_ID. For the total, I don't want to consider multiple times the same values. So, with the following values the total should be 450 and not 850 (default Mondrian behavior).
| FACT_ID | DIM_1_ID | DIM_2_ID | MEASURE |
|---------|----------|----------|---------|
| 1 | A | D | 100 |
| 1 | A | E | 100 |
| 1 | B | F | 50 |
| 2 | A | D | 300 |
| 2 | A | E | 300 |
|---------|----------|----------|---------|
TOTAL | 450 |
Is it possible? How can it be done with Mondrian?
Thanks in advance
UPDATE - Current status
As described in one of the comments bellow, base on #whytheq's answer, I managed to calculate the right value for the total, using the following MDX formula for the measure:
Sum(
Order(
[dActivity.hActivity].[lActivity].MEMBERS*[dFacility.hFacility].[lFacilit‌​y].MEMBERS,
[dActivity.hActivity].[lActivity].currentmember.name
) as [m_set] ,
iif(
[m_set].currentordinal = 0
OR
not(
[m_set]
.item([m_set].currentordinal)
.item(0).NAME
=
[m_set]
.item([m_set].currentordinal-1)
.item(0).NAME
) ,
[Measures].[mBudget]
,
0
)
)
However, this expression is using the complete set for every single row, so the result overrides the measure real value for the different fact rows.
| FACT_ID | DIM_1_ID | DIM_2_ID | MEASURE |
|---------|----------|----------|---------|
| 1 | A | D | 450 |
| 1 | A | E | 450 |
| 1 | B | F | 450 |
| 2 | A | D | 450 |
| 2 | A | E | 450 |
|---------|----------|----------|---------|
TOTAL | 450 |
Great question - really tricky to do in MDX.
If we do the following then there are 158 rows returned - a handful have duplicate values for [Measures].[Internet Sales Amount]:
SELECT
[Measures].[Internet Sales Amount] ON 0
,NON EMPTY
Order
(
[Product].[Product].[Product]
,[Measures].[Internet Sales Amount]
,bdesc
) ON 1
FROM [Adventure Works];
This only counts them if the member above is different for the respective measure:
WITH
SET [x] AS
Order
(
NonEmpty
(
[Product].[Product].[Product]
,[Measures].[Internet Sales Amount]
)
,[Measures].[Internet Sales Amount]
,bdesc
)
SET [FILTERED] AS
Filter
(
[x]
,
(
[x].Item(
[x].CurrentOrdinal - 1)
,[Measures].[Internet Sales Amount]
)
<>
(
[x].Item(
[x].CurrentOrdinal)
,[Measures].[Internet Sales Amount]
)
)
MEMBER [Measures].[distCount] AS
Count([FILTERED])
SELECT
[Measures].[distCount] ON 0
FROM [Adventure Works];
Maybe try adding the EXISTING keyword into your calculatio:
Sum
(
Order
(
EXISTING //<<<
[dActivity.hActivity].[lActivity].MEMBERS
*
[dFacility.hFacility].[lFacilit‌​y].MEMBERS
,[dActivity.hActivity].[lActivity].CurrentMember.Name
) AS [m_set]
,IIF
(
[m_set].CurrentOrdinal = 0
OR
(NOT
[m_set].Item(
[m_set].CurrentOrdinal).Item(0).Name
=
[m_set].Item(
[m_set].CurrentOrdinal - 1).Item(0).Name)
,[Measures].[mBudget]
,0
)
)
You could try to obtain the average over the set. The code is a bit complex.
WITH SET SomeSet AS
{
Fact.FactID.FactID.MEMBERS
*
Fact.DimID1.DimID1.MEMBERS
*
Fact.DimID2.DimID2.MEMBERS
}
MEMBER Measures.AvgVal AS
AVG
(
{Fact.FactID.CURRENTMEMBER}
*
{Fact.DimID1.CURRENTMEMBER}
*
NonEmpty
(
Fact.DimID2.DimID2.MEMBERS,
{{Fact.FactID.CURRENTMEMBER} *
{Fact.DimID1.CURRENTMEMBER}} *
[Measures].[TheMeasure]
)
,
[Measures].[TheMeasure]
)
SELECT NON EMPTY SomeSet ON 1,
NON EMPTY {
[Measures].[TheMeasure],
Measures.AvgVal
} on 0
from [YourCube]
What I am doing is, for the current FactID- DimID1 combination on the axis, I am getting the list of all possible DimID2s and then, over the internally generated non-empty tuples of FactID-DimID1-DimID2, deriving the average value of the measure TheMeasure
So, for example (100+100)/2 = 100 value would be displayed for the combination of FactID = 1 and DimID1 = A

Oracle Recursive Subquery Factoring convert

I'm trying to use this recursive SQL feature but can't get it to do what I want, not even close. I've coded up the logic in an unrolled loop, asking if it can be converted into a single recursive SQL query, not the table update style I've used.
http://sqlfiddle.com/#!4/b7217/1
There are six players to be ranked. They have id, group id, score and rank.
Initial state
+----+--------+-------+--------+
| id | grp_id | score | rank |
+----+--------+-------+--------+
| 1 | 1 | 100 | (null) |
| 2 | 1 | 90 | (null) |
| 3 | 1 | 70 | (null) |
| 4 | 2 | 95 | (null) |
| 5 | 2 | 70 | (null) |
| 6 | 2 | 60 | (null) |
+----+--------+-------+--------+
I want to take the person with the highest initial score and give them rank 1. Then I apply 10 bonus points to the score of everyone who has the same group id. Take the next highest, assign rank 2, distribute bonus points and so on until there are no players left.
User id breaks ties.
The bonus points changes the ranking. id=4 initially appears to be second placed with 95, behind the leader with 100 but with the 10 pts bonus, id=2 moves up and takes the spot.
Final state
+-----+---------+--------+------+
| ID | GRP_ID | SCORE | RANK |
+-----+---------+--------+------+
| 1 | 1 | 100 | 1 |
| 2 | 1 | 100 | 2 |
| 4 | 2 | 95 | 3 |
| 3 | 1 | 90 | 4 |
| 5 | 2 | 80 | 5 |
| 6 | 2 | 80 | 6 |
+-----+---------+--------+------+
This is a quite a bit late, but I'm not sure this can be done using Recursive CTE. I did however come up with a solution using the MODEL clause:
WITH SAMPLE (ID,GRP_ID,SCORE,RANK) AS (
SELECT 1,1,100,NULL FROM DUAL UNION
SELECT 2,1,90,NULL FROM DUAL UNION
SELECT 3,1,70,NULL FROM DUAL UNION
SELECT 4,2,95,NULL FROM DUAL UNION
SELECT 5,2,70,NULL FROM DUAL UNION
SELECT 6,2,60,NULL FROM DUAL)
SELECT ID,GRP_ID,SCORE,RANK FROM SAMPLE
MODEL
DIMENSION BY (ID,GRP_ID)
MEASURES (SCORE,0 RANK,0 LAST_RANKED_GRP,0 ITEM_COUNT,0 HAS_RANK)
RULES
ITERATE (1000) UNTIL (ITERATION_NUMBER = ITEM_COUNT[1,1]) --ITERATE ONCE FOR EACH ITEM TO BE RANKED
(
RANK[ANY,ANY] = CASE WHEN SCORE[CV(),CV()] = MAX(SCORE) OVER (PARTITION BY HAS_RANK) THEN RANK() OVER (ORDER BY SCORE DESC,ID) ELSE RANK[CV(),CV()] END, --IF THE CURRENT ITEM SCORE IS EQUAL TO THE MAX SCORE OF UNRANKED, ASSIGN A RANK
LAST_RANKED_GRP[ANY,ANY] = FIRST_VALUE(GRP_ID) OVER (ORDER BY RANK DESC),
SCORE[ANY,ANY] = CASE WHEN RANK[CV(),CV()] = 0 AND CV(GRP_ID) = LAST_RANKED_GRP[CV(),CV()] THEN SCORE[CV(),CV()]+10 ELSE SCORE[CV(),CV()] END,
ITEM_COUNT[ANY,ANY] = COUNT(*) OVER (),
HAS_RANK[ANY,ANY] = CASE WHEN RANK[CV(),CV()] <> 0 THEN 1 ELSE 0 END --TO SEPARATE RANKED/UNRANKED ITEMS
)
ORDER BY RANK;
It's not very pretty, and I suspect there is a better way to go about this, but it does give the expected output.
Caveats:
You'd have to increase the iteration count if you have more than that number of rows.
This does a full re-ranking based on the score after each iteration. So if we took your sample data, but changed the initial score of item 2 to 95 rather than 90: after ranking item 1 and giving the 10 point bonus to item 2, it now has a score of 105. So we rank it as 1st and move item 1 down to 2nd. You'd have to make a few modifications if this is not the desired behavior.

SQL Server- RAND() - range

I want to create random numbers between 1 and 99,999,999.
I am using the following code:
SELECT CAST(RAND() * 100000000 AS INT) AS [RandomNumber]
However my results are always between the length of 7 and 8, which means that I never saw a value lower then 1,000,000.
Is there any way to generate random numbers between a defined range?
RAND Returns a pseudo-random float value from 0 through 1, exclusive.
So RAND() * 100000000 does exactly what you need. However assuming that every number between 1 and 99,999,999 does have equal probability then 99% of the numbers will likely be between the length of 7 and 8 as these numbers are simply more common.
+--------+-------------------+----------+------------+
| Length | Range | Count | Percent |
+--------+-------------------+----------+------------+
| 1 | 1-9 | 9 | 0.000009 |
| 2 | 10-99 | 90 | 0.000090 |
| 3 | 100-999 | 900 | 0.000900 |
| 4 | 1000-9999 | 9000 | 0.009000 |
| 5 | 10000-99999 | 90000 | 0.090000 |
| 6 | 100000-999999 | 900000 | 0.900000 |
| 7 | 1000000-9999999 | 9000000 | 9.000000 |
| 8 | 10000000-99999999 | 90000000 | 90.000001 |
+--------+-------------------+----------+------------+
I created a function that might help. You will need to send it the Rand() function for it to work.
CREATE FUNCTION [dbo].[RangedRand]
(
#MyRand float
,#Lower bigint = 0
,#Upper bigint = 999
)
RETURNS bigint
AS
BEGIN
DECLARE #Random BIGINT
SELECT #Random = ROUND(((#Upper - #Lower) * #MyRand + #Lower), 0)
RETURN #Random
END
GO
--Here is how it works.
--Create a test table for Random values
CREATE TABLE #MySample
(
RID INT IDENTITY(1,1) Primary Key
,MyValue bigint
)
GO
-- Lets use the function to populate the value column
INSERT INTO #MySample
(MyValue)
SELECT dbo.RangedRand(RAND(), 0, 100)
GO 1000
-- Lets look at what we get.
SELECT RID, MyValue
FROM #MySample
--ORDER BY MyValue -- Use this "Order By" to see the distribution of the random values
-- Lets use the function again to get a random row from the table
DECLARE #MyMAXID int
SELECT #MyMAXID = MAX(RID)
FROM #MySample
SELECT RID, MyValue
FROM #MySample
WHERE RID = dbo.RangedRand(RAND(), 1, #MyMAXID)
DROP TABLE #MySample
--I hope this helps.

Sum of the grouped distinct values

This is a bit hard to explain in words ... I'm trying to calculate a sum of grouped distinct values in a matrix. Let's say I have the following data returned by a SQL query:
------------------------------------------------
| Group | ParentID | ChildID | ParentProdCount |
| A | 1 | 1 | 2 |
| A | 1 | 2 | 2 |
| A | 1 | 3 | 2 |
| A | 1 | 4 | 2 |
| A | 2 | 5 | 3 |
| A | 2 | 6 | 3 |
| A | 2 | 7 | 3 |
| A | 2 | 8 | 3 |
| B | 3 | 9 | 1 |
| B | 3 | 10 | 1 |
| B | 3 | 11 | 1 |
------------------------------------------------
There's some other data in the query, but it's irrelevant. ParentProdCount is specific to the ParentID.
Now, I have a matrix in the MS Report Designer in which I'm trying to calculate a sum for ParentProdCount (grouped by "Group"). If I just add the expression
=Sum(Fields!ParentProdCount.Value)
I get a result 20 for Group A and 3 for Group B, which is incorrect. The correct values should be 5 for group A and 1 for group B. This wouldn't happen if there wasn't ChildID involved, but I have to use some other child-specific data in the same matrix.
I tried to nest FIRST() and SUM() aggregate functions but apparently it's not possible to have nested aggregation functions, even when they have scopes defined.
I'm pretty sure there is some way to calculate the grouped distinct sum without needing to create another SQL query. Anyone got an idea how to do that?
Ok I got this sorted out by adding a ROW_NUMBER() function my SQL query:
SELECT Group, ParentID, ROW_NUMBER() OVER (PARTITION BY ParentID ORDER BY ChildID ASC) AS Position, ChildID, ParentProdCount FROM Table
and then I replaced the SSRS SUM function with
=SUM(IIF(Position = 1, ParentProdCount.Value, 0))
Put a grouping over the ParentID and use a summation over that group,
eg:
if group over ParentID = "ParentIDGroup"
then
column sum of ParentPrdCount = SUM(Fields!ParentProdCount.Value,"ParentIDGroup")

Resources