Hive TABLESAMPLE on clustered table - hadoop

I want to ask the correct bucketing and tablesample way.
There is a table X which I created by
CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC
Then I inserted 500M of rows into X by
set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW
Then I want to count or search some rows with condition. roughly,
SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
But I'd better to USE tablesample as I clustered X (action_id, classifier).
So, the better query will be
SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'
Is there any wrong above?
But I can't not find any performance gain between these two query.
query1 and RESULT( with no tablesample.)
SELECT COUNT(*)) from X
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.35 s
--------------------------------------------------------------------------------
It scans full data.
query 2 and RESULT
SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.82 s
--------------------------------------------------------------------------------
It ALSO scans full data.
query 2 RESULT WHAT I EXPECTED.
Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.xx s
--------------------------------------------------------------------------------
Values of action_id and classifier are well distributed and there is no skewed data.
So I want to ask you what will be a correct query to target Only 1 Bucket and Use 1 map??

Related

Dax measure- sum of percent of total by group with condition

For simplicity sake, I have the following dummy data:
id val
1 5
1 30
1 50
1 15
2 120
2 60
2 10
2 10
My desired output is the following:
id SUM_GT_10%
1 95%
2 90%
SUM_GT_10% can be obtained by the following steps:
Calculate the sum of val for each id
Divide val by 1
sum of 2 if 2 > 10%
using the example data, the sum of val is 100 for id 1 and 200 for id 2, so we would obtain the following additional columns:
id val 1 2
1 5 100 5%
1 30 100 30%
1 50 100 50%
1 15 100 15%
2 120 200 60%
2 60 200 30%
2 10 200 5%
2 10 200 5%
And our final output (step 3) would be sum of 2 where 2> 10%:
id SUM_GT_10%
1 95%
2 90%
I don't care about the intermediate columns, just the final output, of course.
James, you might want to create a temporary table in your measure and then sum its results:
tbl_SumVAL =
var ThisId = MAX(tbl_VAL[id])
var temp =
FILTER(
SELECTCOLUMNS(tbl_VAL, "id", tbl_VAL[id], "GT%",
tbl_VAL[val] / SUMX(FILTER(tbl_VAL, tbl_VAL[id] = ThisId), tbl_VAL[val])),
[GT%] > 0.1
)
return
SUMX(temp, [GT%])
The temp table is basically recreating two steps that you have described (divide "original" value by the sum of all values for each ID), and then leaving only those values that are greater than 0.1. Note that if your id is not a number, then you'd need to replace MAX(tbl_VAL[id]) with SELECTEDVALUE(tbl_VAL[id]).
The final result looks like that -
Also, make sure to set your id field to "Not Summarize", in case id is a number -

H2o: Is there a way to fix threshold in H2ORandomForestEstimator performance during training and testing?

I have built a model with H2ORandomForestEstimator and the results shows something like this below.
The threshold keeps changing (0.5 from traning and 0.313725489027 from validation) and I like to fix the threshold in H2ORandomForestEstimator for comparison during fine tuning. Is there a way to set the threshold?
From http://h2o-release.s3.amazonaws.com/h2o/master/3484/docs-website/h2o-py/docs/modeling.html#h2orandomforestestimator, there is no such parameter.
If there is no way to set this, how do we know what threshold our model is built on?
rf_v1
** Reported on train data. **
MSE: 2.75013548238e-05
RMSE: 0.00524417341664
LogLoss:0.000494320913199
Mean Per-Class Error: 0.0188802936476
AUC: 0.974221763605
Gini: 0.948443527211
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.5:
0 1 Error Rate
----- ------ --- ------- --------------
0 161692 1 0 (1.0/161693.0)
1 3 50 0.0566 (3.0/53.0)
Total 161695 51 0 (4.0/161746.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.5 0.961538 19
max f2 0.25 0.955056 21
max f0point5 0.571429 0.983936 18
max accuracy 0.571429 0.999975 18
max precision 1 1 0
max recall 0 1 69
max specificity 1 1 0
max absolute_mcc 0.5 0.961704 19
max min_per_class_accuracy 0.25 0.962264 21
max mean_per_class_accuracy 0.25 0.98112 21
Gains/Lift Table: Avg response rate: 0.03 %
** Reported on validation data. **
MSE: 1.00535766226e-05
RMSE: 0.00317073755183
LogLoss: 4.53885183426e-05
Mean Per-Class Error: 0.0
AUC: 1.0
Gini: 1.0
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.313725489027:
0 1 Error Rate
----- ----- --- ------- -------------
0 53715 0 0 (0.0/53715.0)
1 0 16 0 (0.0/16.0)
Total 53715 16 0 (0.0/53731.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- ------- -----
max f1 0.313725 1 5
max f2 0.313725 1 5
max f0point5 0.313725 1 5
max accuracy 0.313725 1 5
max precision 1 1 0
max recall 0.313725 1 5
max specificity 1 1 0
max absolute_mcc 0.313725 1 5
max min_per_class_accuracy 0.313725 1 5
max mean_per_class_accuracy 0.313725 1 5
The threshold is max-F1.
If you want to apply your own threshold, you will have to take the probability of the positive class and compare it yourself to produce the label you want.
If you use your web browser to connect to the H2O Flow Web UI inside of H2O-3, you can mouse over the ROC curve and visually browse the confusion matrix for each threshold, which is convenient.

How to optimize search of rows x columns combination in a matrix?

Given a matrix of 1's and 0's, I want to find a combination of rows and columns with least or none 0's, maximizing the n_of_rows * n_of_columns picked.
For example, rows (0,1,2) and columns (0,1,3) have only one zero in col #0 row #1, and the rest 8 values are 1's.
1 1 0 1 0
0 1 1 1 0
1 1 0 1 1
0 0 1 0 0
Pracical task is to search over 1000's to 1000000's of rows and columns, finding the maximal biclique in a bipartite graph – rows and cols can be viewed as verticles, and values as connections.
The problem in NP-complete, as far as I learned.
Please advice an approach / algorithm that would speed up the task and reduce requirements to CPU and memory.
Not sure you could minimise thism
However, easy way to work this out would be...
Multiple your matrix by a 1 column and n rows full of 1's. This will give you number of ones in each row. Next do a 1 row by n columns multiplcation (at frot of) your matrix full of 1's. This will give you totals of 1's for each column, From there it's a pretty easy compairson........
ie original matrix...
1 0 1
0 1 1
0 0 0
do
1 0 1 x 1 = 2 (row totals)
o 1 1 1 2
0 0 0 1 0
do
1 1 1 x 1 0 1 = 1 (Column totals)
0 1 1 2
0 0 0 0
nb max sum is 2 (which you would keep track of as you work it out.
Actually given the following assumptions:
1. You don't care how many 0's are in each row or column
2. You don't need to keep track of their order....
Then you only really need to store values to count the total in each row/column as you read the values in and don't actually store the matrix itself.
If you are given the number of rows and columns prior to reading in the matrix you can do the following heuristics to reduce computational time...
Keep track of the current max. If the current row cannot reach this potential max stop counting for the row (but continue in the columns). Vice versa is true for the columns
But you still have a worst case scenario in which all rows and columns have sme number of 1's and 0's.... :)

Create a Matrix with conditions

I am struggling to create two matrix in SAS based on certain conditions.
trying to create a 12x12 matrix in the format below:
col1 col2 col3 col4 ............col12
1 0 0 0 ............ 0
1 1 0 0 ............ 0
1 1 1
0 1 1
0 0 1
1 0 0
1 1 0
1 1 1
0 1 1
0 0 1
0 0 0
0 0 0
and so on.
and this-
col1 col2 col3 col4 ............col12
1 0 0 0 ............ 0
1 2 0 0 ............ 0
1 2 3
0 2 3
0 0 3
1 0 0
1 2 0
1 2 3
0 2 3
0 0 3
0 0 0
0 0 0
and so on. Basically displays the col# instead of 1's.
I read a couple of articles online and tried Proc IML but i got an error that the procedure doesn't exist.
I tried the code below to start with but nothing. I am confused as to how should I enter the conditions.
data test_matrices ;
array col(12) col1-col12;
do i=1 to 12;
j=i-1;
col(i)=ifn(i le 5 , 1, 0,0);
output;
end;
run;
Please help.
Thanks.
Jay
What you need to start with:
Arrays have to have a name, and unless they're temporary arrays they also need variable names (they'll take the name concatenated with the array index if you don't provide it). So:
array (*) 1-12;
needs to be
array myVars(12) col1-col12;
You need two loops, one to define your 'rows' and one to work on your columns, nested. IE, for row 1, do something 12 times, for row 2, do something 12 times.
So this:
do i=1 to 12;
do j=1 to 12;
... do stuff ...
end;
output; *you had this right! It goes in the outer loop since it defines rows.;
end;
Now, you have something that lets you work on just one cell. So you're on cell (i,j); what rule defines what should go there? Figure out that logic, and then set myvars[j] to that value. You can't operate on the 'i' parameter, but instead that's going to just define how often you output.
Ie, this:
myvars[j] = i;
That's not correct, but figure out what is correct and assign that to myvars[j].

How to substitute a for-loop with vecorization acting several thousand times per data.frame row?

Being still quite wet behind the ears concerning R and - more important - vectorization, I cannot get my head around how to speed up the code below.
The for-loop calculates a number of seeds falling onto a road for several road segments with different densities of seed-generating plants by applying a random propability for every seed.
As my real data frame has ~200k rows and seed numbers are up to 300k/segment, using the example below would take several hours on my current machine.
#Example data.frame
df <- data.frame(Density=c(0,0,0,3,0,120,300,120,0,0))
#Example SeedRain vector
SeedRainDists <- c(7.72,-43.11,16.80,-9.04,1.22,0.70,16.48,75.06,42.64,-5.50)
#Calculating the number of seeds from plant densities
df$Seeds <- df$Density * 500
#Applying a probability of reaching the road for every seed
df$SeedsOnRoad <- apply(as.matrix(df$Seeds),1,function(x){
SeedsOut <- 0
if(x>0){
#Summing up the number of seeds reaching a certain distance
for(i in 1:x){
SeedsOut <- SeedsOut +
ifelse(sample(SeedRainDists,1,replace=T)>40,1,0)
}
}
return(SeedsOut)
})
If someone might give me a hint as to how the loop could be substituted by vectorization - or maybe how the data could be organized better in the first place to improve performance - I would be very grateful!
Edit: Roland's answer showed that I may have oversimplified the question. In the for-loop I extract a random value from a distribution of distances recorded by another author (that's why I can't supply the data here). Added an exemplary vector with likely values for SeedRain distances.
This should do about the same simulation:
df$SeedsOnRoad2 <- sapply(df$Seeds,function(x){
rbinom(1,x,0.6)
})
# Density Seeds SeedsOnRoad SeedsOnRoad2
#1 0 0 0 0
#2 0 0 0 0
#3 0 0 0 0
#4 3 1500 892 877
#5 0 0 0 0
#6 120 60000 36048 36158
#7 300 150000 90031 89875
#8 120 60000 35985 35773
#9 0 0 0 0
#10 0 0 0 0
One option is generate the sample() for all Seeds per row of df in a single go.
Using set.seed(1) before your loop-based code I get:
> df
Density Seeds SeedsOnRoad
1 0 0 0
2 0 0 0
3 0 0 0
4 3 1500 289
5 0 0 0
6 120 60000 12044
7 300 150000 29984
8 120 60000 12079
9 0 0 0
10 0 0 0
I get the same answer in a fraction of the time if I do:
set.seed(1)
tmp <- sapply(df$Seeds,
function(x) sum(sample(SeedRainDists, x, replace = TRUE) > 40)))
> tmp
[1] 0 0 0 289 0 12044 29984 12079 0 0
For comparison:
df <- transform(df, GavSeedsOnRoad = tmp)
df
> df
Density Seeds SeedsOnRoad GavSeedsOnRoad
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 3 1500 289 289
5 0 0 0 0
6 120 60000 12044 12044
7 300 150000 29984 29984
8 120 60000 12079 12079
9 0 0 0 0
10 0 0 0 0
The points to note here are:
try to avoid calling a function repeatedly in a loop if you the function is vectorised or can generate the entire end result with a single call. Here you were calling sample() Seeds times for each row of df, each call returning a single sample from SeedRainDists. Here I do a single sample() call asking for sample size Seeds, for each row of df - hence I call sample 10 times, your code called it 271500 times.
even if you have to repeatedly call a function in a loop, remove from the loop anything that is vectorised that could be done on the entire result after the loop is done. An example here is your accumulating of SeedsOut, which is calling +() a large number of times.
Better would have been to collect each SeedsOut in a vector, and then sum() that vector outside the loop. E.g.
SeedsOut <- numeric(length = x)
for(i in seq_len(x)) {
SeedsOut[i] <- ifelse(sample(SeedRainDists,1,replace=TRUE)>40,1,0)
}
sum(SeedOut)
Note that R treats a logical as if it were numeric 0s or 1s where used in any mathematical function. Hence
sum(ifelse(sample(SeedRainDists, 100, replace=TRUE)>40,1,0))
and
sum(sample(SeedRainDists, 100, replace=TRUE)>40)
would give the same result if run with the same set.seed().
There may be a fancier way of doing the sampling requiring fewer calls to sample() (and there is, sample(SeedRainDists, sum(Seeds), replace = TRUE) > 40 but then you need to take care of selecting the right elements of that vector for each row of df - not hard, just a light cumbersome), but what i show may be efficient enough?

Resources