I would like to obtain the first quartile of values from a column (speed) of data in table totalSpeeds.
To do this, I tried creating a variable (threshold), then selected values that were less than or equal to it.
SET threshold = (SELECT 0.25*MAX(speed) FROM totalSpeeds);
SELECT speed FROM totalSpeeds WHERE speed <= ${hiveconf:threshold};
This failed and returned a parse error. Is there a more efficient way of obtaining the upper-bound of the first quartile of speeds? Or is there a way of tweaking the above commands to return the first-quartile speeds?
Thanks in advance,
Anita
There is a built in UDF in hive for calculating percentiles. use
select percentile(speed, .25) from totalSpeeds;
explanation of UDF:
Returns the exact pth percentile of a column in the group. p must be between 0 and 1
Similarly we can extract multiple percentiles also by using percentile(speed, array(p1, p2))
Related
I have an excel that I'm calculating my Scrum Task's completed average. I have Story point item also in the excel. My calculation is:
Result= SP * percentage of completion --> This calculation is for each row and after that I sum up all result and taking the summary.
But sometimes I am adding new task and for each task I am adding the calculation to the average result.
Is there any way to use for loop in the excel?
for(int i=0;i<50;i++){ if(SP!=null && task!=null)(B+i)*(L+i)}
My calculation is like below:
AVERAGE((B4*L4+B5*L5+B6*L6+B7*L7+B8*L8+B9*L9+B10*L10)/SUM(B4:B10))
First of all, AVERAGE is not doing anything in your formula, since the argument you pass to it is just one single value. You already do an average calculation by dividing by the sum. That average is in fact a weighted average, and so you could not even achieve that with a plain AVERAGE function.
I see several ways to make this formula more generic, so it keeps working when you add rows:
1. Use SUMPRODUCT
=SUMPRODUCT(B4:B100,L4:L100)/SUM(B4:B100)
The row number 100 is chosen arbitrarily, but should evidently encompass all data rows. If you have no data occurring below your table, then it is safe to add a large margin. You'll want to avoid the situation where you think you add a line to the table, but actually get outside of the range of the formula. Using proper Excel tables can help to avoid this situation.
2. Use an array formula
This would be a second resort for when the formula becomes more complicated and cannot be executed with a "simple" SUMPRODUCT. But the above would translate to this array formula:
=SUM(B4:B100*L4:L100)/SUM(B4:B100)
Once you have typed this in the formula bar, make sure to press Ctrl+Shift+Enter to enter it. Only then will it act as an array formula.
Again, the same remark about row number 100.
3. Use an extra column
Things get easy when you use an extra column for storing the product of B & L values for each row. So you would put in cell N4 the following formula:
=B4*L4
...and then copy that relative formula to the other rows. You can hide that column if you want.
Then the overal formula can be:
=SUM(N4:N100)/SUM(B4:B100)
With this solution you must take care to always copy a row when inserting a new row, as you need the N column to have the intermediate product formula also for any new row.
I'm calculating cumulative total in DAX like:
DEFINE MEASURE 'Sales'[Running Total] =
CALCULATE (
SUM('Sales'[Revenue]),
FILTER(ALL('Date'[Date]),'Date'[Date]<=MAX('Date'[Date]))
)
This should be well-estabilished pattern (at least it is referenced here: http://www.daxpatterns.com/cumulative-total/)
My problem is when I try to evaluate it like:
EVALUATE SUMMARIZECOLUMNS(
'Date'[Date],
"Total_Revenue_By_Date",
'Sales'[Running Total]
)
I'm running into error
The resultset of a query to external data source has exceeded the maximum allowed size of '1000000' rows.
I'm using tabular model with direct query. I know I can enlarge the limit, however the underlying tables are small - Date table has around 10000 rows, Sales table has around 10000 rows as well (it will be much larger on production), so something here doesn't scale well.
I have an idea how to get away with calculating running totals on SQL level, any idea how to tackle this on DAX level ?
Models created by Power BI desktop has default limit of 1 million rows.
This might help you,
https://www.sqlbi.com/articles/tuning-query-limits-for-directquery/
I have a huge table (more than 1 billion rows) in Impala. I need to sample ~ 100,000 rows several times. What is the best to query sample rows?
As Jeff mentioned, what you've asked for exactly isn't possible yet, but we do have an internal aggregate function which takes 200,000 samples (using reservoir sampling) and returns the samples, comma-delimited as a single row. There is no way to change the number of samples yet. If there are fewer than 200,000 rows, all will be returned. If you're interested in how this works, see the implementation of the aggregate function and reservoir sampling structures.
There isn't a way to 'split' or explode the results yet, either, so I don't know how helpful this will be.
For example, sampling trivially from a table with 8 rows:
> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id) |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s
(For context: this was added in a past release to support histogram statistics in the planner, which unfortunately isn't ready yet.)
Impala does not currently support TABLESAMPLE, unfortunately. See https://issues.cloudera.org/browse/IMPALA-1924 to follow its development.
In retrospect, knowing that TABLESAMPLE is unavailable, one could add a field "RVAL" (random 32-bit integer, for instance) to each record, and sample repeatedly by adding "where RVAL > x and RVAL < y", for appropriate values of x and y. Non-overlapping intervals [x1,y1], [x2,y2],... will be independent. You can also select using "where RVAL%10000 = 1, =2, ... etc, for a separate population of independent subsets.
TABLESAMPLE mentioned in other answers is now available in newer versions of impala (>=2.9.0), see documentation.
Here's an example of how you could use it to sample 1% of your data:
SELECT foo FROM huge_table TABLESAMPLE SYSTEM(1)
or
SELECT bar FROM huge_table TABLESAMPLE SYSTEM(1) WHERE name='john'
Looks like percentage argument must be an integer, so the smallest sample you can take is limited to 1%.
Keep in mind that the proportion of sampled data from the table is not guaranteed and may be greater than the specified percentage (in this case more than 1%). This is explained in greater detail in Impala's documentation.
If you are looking for sample over certain column(s), you can check below answer.
Say, you have global data and you want to pick 10% from them randomly and create your dataset. You can use any combination of columns too - like city, zip code and state.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 10/100 -- This is for 10% data
Link -
Randomly sampling n rows in impala using random() or tablesample system()
I am trying to use pig's rank operator to assign integer number for a given string. Although it works when I set the parallel clause to 1, it doesn't with a higher value (like 200). I need to use multiple reducer to speed up the processing since by default, pig is only using one reducer, which takes a long time.
My query is as follows :
rank = rank tupl1 by col1 ASC parallel 200;
Actually according to the pig documentation (https://pig.apache.org/docs/r0.11.1/perf.html#parallel) :
You can include the PARALLEL clause with any operator that starts a
reduce phase: COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN
(outer), and ORDER BY.
That's why you have an error I think, it's not possible to set the PARALLEL parameter for rank.
I need the select the top x% rows of a table in Pig. Could someone tell me how to do it without writing a UDF?
Thanks!
As mentioned before, first you need to count the number of rows in your table and then obviously you can do:
A = load 'X' as (row);
B = group A all;
C = foreach B generate COUNT(A) as count;
D = LIMIT A C.count/10; --you might need a cast to integer here
The catch is that, dynamic argument support for LIMIT function was introduced in Pig 0.10. If you're working with a previous version, then a suggestion is offered here using the TOP function.
Not sure how you would go about pulling a percentage, but if you know your table size is 100 rows, you can use the LIMIT command to get the top 10% for example:
A = load 'myfile' as (t, u, v);
B = order A by t;
C = limit B 10;
(Above example adapted from http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+LIMIT+Operator)
As for dynamically limiting to 10%, not sure you can do this without knowing how 'big' the table is, and i'm pretty sure you couldn't do this in a UDF, you'd need to run a job to count the number of rows, then another job to do the LIMIT query.
I won't write the pig code as it will take a while to write and test, but I would do it like this (if you need the exact solution, if not, there are simpler methods):
Get a sample from your input. Say a few thousand data points or so.
Sort this and find the n quantiles, where n should be somewhere in the order of the number of reducers you have or somewhat larger.
Count the data points for each quantile.
At this point the min point of the top 10% will fall into one of these intervals. Find this interval (this is easy as the counts will tell you exactly where it is), and using the sum of the counts of the larger quantiles together with the relevant quantile find the 10% point in this interval.
Go over your data again and filter out everything but the points larger than the one you just found.
Portions of this might require UDFs.