How to simulate BigQuery's quantiles in Hive - hadoop

I want to simulate BigQuery's QUANTILES function in Hive.
Data set: 1,2,3,4
BigQuery's query result will return value 2
select nth(2, quantiles(col1, 3))
But in Hive:
select percentile(col1, 0.5)
I've got 2.5
Note: I've got same result for odd number of records.
Is there any adequate Hive's udf functions?

I guess what you are looking for is the percentile_approx UDF.
This page gives you the list of all built-in UDFs in Hive.
percentile_approx(DOUBLE col, p [, B])
Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.

Related

Neo4j: Difference between rand() and rand() in with clause when matching random nodes

I found here that i can select random nodes from neo4j using next queries:
MATCH (a:Person) RETURN a ORDER BY rand() limit 10
MATCH (a:Person) with a, rand() as rnd RETURN a ORDER BY rnd limit 10
Both queries seems to do the same thing but when I try to match random nodes that are in relationship with a given node then I have different results:
The next query will return always the same nodes (nodes are not randomly selected)
MATCH (p:Person{user_id: '1'})-[r:REVIEW]->(m:Movie)
return m order by rand() limit 10
...but when I use rand() in a with clause I get indeed random nodes:
MATCH (p:Person{user_id: '1'})-[r:REVIEW]->(m:Movie)
with m, rand() as rnd
return m order by rnd limit 10
Any idea why rand() behave different in a with clause in the second query but in the first not?
It's important to understand that using rand() in the ORDER BY like this isn't doing what you think it's doing. It's not picking a random number per row, it's ordering by a single number.
It's similar to a query like:
MATCH (p:Person)
RETURN p
ORDER BY 5
Feel free to switch up the number. In any case, it doesn't change the ordering because ordering every row, when the same number is used, doesn't change the ordering.
But when you project out a random number in a WITH clause per row, then you're no longer ordering by a single number for all rows, but by a variable which is different per row.

How to calculate the average of rows/selected rows in the matrix

I have a matrix and I need to calculate the average of rows in the matrix for the past 12 months.
The average for 'Actual Exp' will be different than the 'Actual Min' values, and RAG will be calculated based on the average value of 'Actual Exp'.
This is how it should look like with calculated averages.
I don't know how to get the average for 'Actual Exp' and 'Actual Min' in a matrix.
Thanks, guys
To calculate the separate averages for Actual Exp and Actual Min you can create a table with measures.
The formula for the measure is:
Avg_Actual Exp = AVERAGEX({Table Name}, {Column Name})
This formula will also consider only the valid values and will omit the text or null values.
For those values which only have null or text values in the rows, the formula is:
Avg_Actual Exp = IF(ISNUMBER(AVERAGEX({Table Name}, {Column Name}), AVERAGEX({Table Name}, {Column Name},"N/A" )
This will give you the "N/A" if the whole row has null values.
You can replicate the same thing for the Actual Min.

Efficient sampling of a fixed number of rows in BigQuery

I have a large dataset of size N, and want to get a (uniformly) random sample of size n. This question offers two possible solutions:
SELECT foo FROM mytable WHERE RAND() < n/N
→ This is fast, but doesn't give me exactly n rows (only approximately).
SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n
→ This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).
Is there a solution that combines the advantages of both? I imagine I could use the first solution to select 2n rows, then sort this smaller dataset, but it's sort of ugly and not guaranteed to work, so I'm wondering whether there's a better option.
I compared the two queries execution times using BigQuery standard SQL with the natality sample dataset (137,826,763 rows) and getting a sample for source_year column of size n. The queries are executed without using cached results.
Query1:
SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < n/137826763
Query2:
SELECT source_year, rand() AS r
FROM `bigquery-public-data.samples.natality`
ORDER BY r
LIMIT n
Result:
n Query1 Query2
1000 ~2.5s ~2.5s
10000 ~3s ~3s
100000 ~3s ~4s
1000000 ~4.5s ~15s
For n <= 105 the difference is ~ 1s and for n >= 106 the execution time differ significantly. The cause seems to be that when LIMIT is added to the query, then the ORDER BY runs on multiple workers. See the original answer provided by Mikhail Berlyant.
I thought your propose to combine both queries could be a possible solution. Therefore I compared the execution time for the combined query:
New Query:
SELECT source_year,rand() AS r
FROM (
SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < 2*n/137826763)
ORDER BY r
LIMIT n
Result:
n Query1 New Query
1000 ~2.5s ~3s
10000 ~3s ~3s
100000 ~3s ~3s
1000000 ~4.5s ~6s
The execution time in this case vary in <=1.5s for n <= 106. It is a good idea to select n+some_rows rows in the subquery instead of 2n rows, where some_rows is a constant number large enough to get more than n rows.
Regarding what you said about “not guaranteed to work”, I understand that you are worried that the new query doesn’t retrieve exactly n rows. In this case, if some_rows is large enough, it will always get more than n rows in the subquery. Therefore, the query will return exactly n rows.
To summarize, the combined query is not so fast as Query1 but it get exactly n rows and it is faster than the Query2. So, it could be a solution for uniformly random samples. I want to point out that if ORDER BY is not specified, the BigQuery output is non-deterministic, which means you might receive a different result each time you execute the query. If you try to execute the following query several times without using cached results, you will got different results.
SELECT *
FROM `bigquery-samples.wikipedia_benchmark.Wiki1B`
LIMIT 5
Therefore, depends on how randomly you want to have the samples, this maybe a better solution.

How does ORACLE DB sum NUMBER(*,s) with many records?

I am wondering how Oracle sums NUMBER(9,2) with SUM(numWithScale/7).
This is because I am wondering how the error will propagate with a large amount of records
Let's say I have a table EMP_SAL with some EMP_ID, numWithScale, numWithScale being a salary.
To make it simple, let us make the numWithScale column NUMBER(9,2) 9 decimals of precision with 2 decimals to round to. All of these numbers in the table are random digits from 10.00-20.00 (ex. 10.12, 20.00, 19.95)
I divide by 7 in my calculation to give random digits at the end that round up or down.
Now, I sum all of the employees salaries with SUM(numWithScale/7).
Will the sum round each time it adds a record? Or does Oracle round after the calculation is complete? i.e. the error can be +/-0.01 from rounding, and with many additions then roundings, error adds up. Or does it round at the end? Thus I dont have to worry about the error adding up (unless I use the result in many more calculations)
Also, will Oracle return the sum as the more precise NUMBER, (38 digit precision, floating point)? or will it round up to the second digit NUMBER(9,2) when returning the value?
Will MSSQL behave pretty much the same way (even though syntax is different?
Oracle performs operation in the order you specified.
So, if you write this query:
select SUM(numWithScale/7) from some_table -- (1)
each of values divided by 7 and rounded to maximum available precision: NUMBER with 38 significant digits. After that all digits are summed.
In case of this query:
select SUM(numWithScale)/7 from some_table -- (2)
all numWithScale values are summed and only after that divided by 7. In this case there are no precision loss for each record, only result of sum() division by 7 are rounded to 38 significant digits.
This problem are common for calculation algorithms. Each time when you divide value by 7 you produce small calculation error because of limited number of digits, representing a number:
numWithScale/7 => quotient + delta.
While summing this values you got
sum(quotient) + sum(delta).
If numWithScale represents ideal uniform distribution and and a some_table contains infinite number of records, then sum(delta) tends to zero. But it happens only in theory. In practical cases sum(delta) grows and introduces significant error. This is a case of query(1).
On the other hand, summing can't introduce a rounding error if implemented properly. So for query (2) rounding error introduced only in last step, when whole sum divided by 7. Therefore value of delta for this query not affected by number of records.
Number scale and precision is only relevant as column or variable constraint.
When you attempt to store a number that exceeds defined precision it will raise an exception:
create table num (a number(5,2));
insert into num values (123456.789);
=> ORA-01438: value larger than specified precision allowed for this column
When you attempt to store a number that exceeds defined scale it will be rounded:
insert into num values (123.456789);
select a from num;
=> 123.46
Precision and scale do not matter when you read data and perform any calculations on it...
select 100000 + a / 100 from num;
=> 100001.2346
...unless you want to store it back into column with constraints, so above rules apply:
update num set a = a / 100;
select a from num;
=> 1.23
numWithScale/7 will be converted to NUMBER (i.e. it will not be rounded to number(9,2)).

Pearson Correlation from multiple rows

I want to calculate the Pearson Correlation between two arrays.
The function CORR only accepts 2 values which has to be in a table. In my procedure I select multiple rows of numbers from two different sets and I want to calculate the correlation from them.
EDIT:
The corr function is an oracle function which calculates the pearson correlation between two values. Here is the problem. I want to calculate the correlation between two arrays which says to me array1 is similar to array2 of for example 50%.
You can simply calculate average of pairwise correlations
select
(abs(corr1) + abs(corr2) + abs(corr3))/3 as Avg_Corr
from (
SELECT
CORR(a.col1, b.col1) as corr1,
CORR(a.col2, b.col2) as corr2,
CORR(a.col3, b.col3) as corr3
FROM table1 a, table2 b
WHERE a.id = b.id
)
or use more complex but more adequate generalization of Pearson correlation (there are no internal function in Oracle for this)

Resources