Hive percentile_approx function is broken, isn't it? - hadoop

I am using Hive 1.2.1000.2.4.2.0-258.
There are 4850000+ rows in the table, 14511 rows of A between 73 and 74, and 3 cols- group_id, A and B.
Group_id is actually equal to 0.
Almost all of A and B are integers.
I was using the following scripts to find statistic summaries from a table:
select group_id, --group_id=0 a constant
percentile_approx(A , 0.5) as A_mdn,
percentile_approx(A , 0.25) as A_Q1,
percentile_approx(A , 0.75) as A_Q3,
percentile_approx(A , array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile_approx(B , 0.5) as B_mdn,
percentile_approx(B , 0.25) as B_Q1,
percentile_approx(B , 0.75) as B_Q3,
percentile_approx(B , array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id;
The result I got is:
0
73.21058033222496
73.21058033222496
462.16968382794516
[73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,8.11278644563614,17.0]
Then I change the code as following:
select group_id, --group_id=0 a constant
percentile(cast(A as bigint), 0.5) as A_mdn,
percentile(cast(A as bigint), 0.25) as A_Q1,
percentile(cast(A as bigint), 0.75) as A_Q3,
percentile(cast(A as bigint), array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile(cast(B as bigint), 0.5) as B_mdn,
percentile(cast(B as bigint), 0.25) as B_Q1,
percentile(cast(B as bigint), 0.75) as B_Q3,
percentile(cast(B as bigint), array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id
The new result is:
0
72.0
6.0
762.0
[3.0,1.0,1.0,0.0,0.0,0.0]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,9.0,17.0]
To double check the truth, I also load this table to R. Following is the R-result:
A:
Min 0
Q1: 6
Median: 72
Q3: 762
0.2 quantile: 3
0.15 quantile: 1.5
0.1 quantile: 1
0.05 quantile: 0
0.025 quantile:0
0.001 quantile:0
B
Q1: 1
Median: 1
Q3: 2
0.8 quantile: 2
0.85 quantile: 3
0.9 quantile: 4
0.95 quantile: 9
0.975 quantile:17
Obviously, R result is consistent with percentile function, but percentile_approx gives me the wrong answer.

Yeah, the percentile_approx doesn't have any approximation guarantees, except when you set the accuracy to be greater than or equal to the # of data points.
The source for it is here: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
From a quick reading, the gist is that it creates accuracy buckets, and then when it runs out of buckets it merges buckets by finding the two closest buckets and combining them with a weighted sum.
This will break with various inputs though. In particular, if you have datapoints that are very high/very low and are spaced far away from each other, it will break the algorithm. If you first clip your data to be in a range where there are not many outliers, it should perform better.
You might consider randomly sampling the data and computing non-approx percentile instead if your data is too skewed though.

This function returns a true value if "all" the values are integers. You said that almost all of A and B are integers.
Try to cast the complete column A to int and see if you come close to the answer.
I don't think, you will ever get exactly the same answer as R because R's percentile function most likely takes non-integers also.
One way to get the exact answer would be to write your own UDF and use it instead.
Hope this helps!

Related

How to sort for most negative values and most positive values across columns?

I am trying to create a new column in my dataframe based on the maximum values across 3 columns. However, depending on the values within each row, I want it to sort for either the most negative value or the most positive value. If the average for an individual row across the 3 columns is greater than 0, I want it to report the most positive value. If it is less than 0, I want it to report back the most negative value.
Here is an example of the dataframe
A B C
-0.30 -0.45 -0.25
0.25 0.43 0.21
-0.10 0.10 0.25
-0.30 -0.10 0.05
And here is the desired output
A B C D
-0.30 -0.45 -0.25 -0.45
0.25 0.43 0.21 0.43
-0.10 0.10 0.25 0.25
-0.30 -0.10 0.05 -0.30
I had first tried playing around with something like
data %>%
mutate(D = pmax(abs(A), abs(B), abs(C)))
But that just returns a column with the greatest of the absolute values where everything is positive.
Thanks in advance for your help, and apologies if the formatting of the question is off, I don't use this site a lot. Happy to clarify anything as well.

Why does coxph() combined with cluster() give much smaller standard errors than other methods to adjust for clustering (e.g. coxme() or frailty()?

I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
**
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
>

What is the syntax in Oracle to round any number to the greatest/highest place value of that number?

I have a wide variety of numbers
In the ten thousands, thousands, hundreds, etc
I would like to compute the rounding to the highest place value ex:
Starting #: 2555.5
Correctly Rounded : 3000
——
More examples ( in the same report )
Given: 255
Rounded: 300
Given: 25555
Rounded: 30000
Given: 2444
Rounded: 2000
But with the Round() or Ceil() functions I get the following
Given: 2555.5
Did not want : 2556
Any ideas ??? Thank you in advance
You can combine numeric functions like this
SELECT
col,
ROUND(col / POWER(10,TRUNC(LOG(10, col)))) * POWER(10,TRUNC(LOG(10,col)))
FROM Data
See fiddle
Explanation:
LOG(10, number) gets the power you need to raise 10 to in order get the number. E.g., LOG(10, 255) = 2.40654 and 10^2.40654 = 255
TRUNC(LOG(10, col)) the number of digit without the leading digit (2).
POWER(10,TRUNC(LOG(10, col))) converts, e.g., 255 to 100.
Then we divide the number by this rounded number. E.g. for 255 we get 255 / 100 = 2.55.
Then we round. ROUND(2.55) = 3
Finally we multiply this rounded result again by the previous divisor: 3 * 100 = 300.
By using the Oracle ROUND function with a second parameter specifying the number of digits with a negative number of digits, we can simplify the select command (see fiddle)
SELECT
col,
ROUND(col, -TRUNC(LOG(10, col))) AS rounded
FROM Data
You can also use this to round by other fractions like quarters of the main number:
ROUND(4 * col, -TRUNC(LOG(10, col))) / 4 AS quarters
see fiddle
Similar to what Olivier had built, you can use a combination of functions to round the numbers as you need. I had built a similar method except instead of using LOG, I used LENGTH to get the number of non-decimal digits.
WITH
nums (num)
AS
(SELECT 2555.5 FROM DUAL
UNION ALL
SELECT 255 FROM DUAL
UNION ALL
SELECT 25555 FROM DUAL
UNION ALL
SELECT 2444 FROM DUAL)
SELECT num,
ROUND (num, (LENGTH (TRUNC (num)) - 1) * -1) as rounded
FROM nums;
NUM ROUNDED
_________ __________
2555.5 3000
255 300
25555 30000
2444 2000

How to create a uniformly random matrix in Julia?

l want to get a matrix with uniformly random values sampled from [-1,2]
x= rand([-1,2],(3,3))
3x3 Array{Int64,2}:
-1 -1 -1
2 -1 -1
-1 -1 -1
but it takes into consideration just -1 and 2, and I'm looking for continuous values for instance -0.9 , 0.75, -0.09, 1.80.
How can I do that?
Note: I am assuming here that you're looking for uniform random variables.
You can also use the Distributions package:
## Pkg.add("Distributions") # If you don't already have it installed.
using Distributions
rand(Uniform(-1,2), 3,3)
I do quite like isebarn's solution though, as it gets you thinking about the actual properties of the underlying probability distributions.
for random number in range [a,b]
rand() * (b-a) + a
and it works for a matrix aswell
rand(3,3) * (2 - (-1)) - 1
3x3 Array{Float64,2}:
1.85611 0.456955 -0.0219579
1.91196 -0.0352324 0.0296134
1.63924 -0.567682 0.45602
You need to use a FloatRange{Float64} with the dessired step:
julia> rand(-1.0:0.01:2.0, 3, 3)
3x3 Array{Float64,2}:
0.79 1.73 0.95
0.73 1.4 -0.46
1.42 1.68 -0.55

Is there an algorithm that can divide a number into three parts and have their totals match the original number?

For example if you take the following example into consideration.
100.00 - Original Number
33.33 - 1st divided by 3
33.33 - 2nd divided by 3
33.33 - 3rd divided by 3
99.99 - Is the sum of the 3 division outcomes
But i want it to match the original 100.00
One way that i saw it could be done was by taking the original number minus the first two divisions and the result would be my third number. Now if i take those 3 numbers i get my original number.
100.00 - Original Number
33.33 - 1st divided by 3
33.33 - 2nd divided by 3
33.34 - 3rd number
100.00 - Which gives me my original number correctly. (33.33+33.33+33.34 = 100.00)
Is there a formula for this either in Oracle PL/SQL or a function or something that could be implemented?
Thanks in advance!
This version takes precision as a parameter as well:
with q as (select 100 as val, 3 as parts, 2 as prec from dual)
select rownum as no
,case when rownum = parts
then val - round(val / parts, prec) * (parts - 1)
else round(val / parts, prec)
end v
from q
connect by level <= parts
no v
=== =====
1 33.33
2 33.33
3 33.34
For example, if you want to split the value among the number of days in the current month, you can do this:
with q as (select 100 as val
,extract(day from last_day(sysdate)) as parts
,2 as prec from dual)
select rownum as no
,case when rownum = parts
then val - round(val / parts, prec) * (parts - 1)
else round(val / parts, prec)
end v
from q
connect by level <= parts;
1 3.33
2 3.33
3 3.33
4 3.33
...
27 3.33
28 3.33
29 3.33
30 3.43
To apportion the value amongst each month, weighted by the number of days in each month, you could do this instead (change the level <= 3 to change the number of months it is calculated for):
with q as (
select add_months(date '2013-07-01', rownum-1) the_month
,extract(day from last_day(add_months(date '2013-07-01', rownum-1)))
as days_in_month
,100 as val
,2 as prec
from dual
connect by level <= 3)
,q2 as (
select the_month, val, prec
,round(val * days_in_month
/ sum(days_in_month) over (), prec)
as apportioned
,row_number() over (order by the_month desc)
as reverse_rn
from q)
select the_month
,case when reverse_rn = 1
then val - sum(apportioned) over (order by the_month
rows between unbounded preceding and 1 preceding)
else apportioned
end as portion
from q2;
01/JUL/13 33.7
01/AUG/13 33.7
01/SEP/13 32.6
Use rational numbers. You could store the numbers as fractions rather than simple values. That's the only way to assure that the quantity is truly split in 3, and that it adds up to the original number. Sure you can do something hacky with rounding and remainders, as long as you don't care that the portions are not exactly split in 3.
The "algorithm" is simply that
100/3 + 100/3 + 100/3 == 300/3 == 100
Store both the numerator and the denominator in separate fields, then add the numerators. You can always convert to floating point when you display the values.
The Oracle docs even have a nice example of how to implement it:
CREATE TYPE rational_type AS OBJECT
( numerator INTEGER,
denominator INTEGER,
MAP MEMBER FUNCTION rat_to_real RETURN REAL,
MEMBER PROCEDURE normalize,
MEMBER FUNCTION plus (x rational_type)
RETURN rational_type);
Here is a parameterized SQL version
SELECT COUNT (*), grp
FROM (WITH input AS (SELECT 100 p_number, 3 p_buckets FROM DUAL),
data
AS ( SELECT LEVEL id, (p_number / p_buckets) group_size
FROM input
CONNECT BY LEVEL <= p_number)
SELECT id, CEIL (ROW_NUMBER () OVER (ORDER BY id) / group_size) grp
FROM data)
GROUP BY grp
output:
COUNT(*) GRP
33 1
33 2
34 3
If you edit the input parameters (p_number and p_buckets) the SQL essentially distributes p_number as evenly as possible among the # of buckets requested (p_buckets).
I've solved this problem yesterday by subtracting 2 of 3 parts from the starting number, e.g. 100 - 33.33 - 33.33 = 33.34 and the result of summing it up is still 100.

Resources