Percentiles for multiple columns - hadoop

I have a table with around 200-250 columns and I want to compute the percentile for each of these columns.
Hive gives the Function, Percentile(int_exp,p) that returns the pth percentile value of the column int_exp. But it seems redundant to run the same query for rest of the 250 columns. Is there a way I can find the percentile of all columns at one go?

Unfortunately you will have to call the percentile function for each column. One suggestion is that you could dynamically generate this query using some other language (e.g. Java, Ruby, Python, etc.)

Related

Writing a formula in a cell in Google Sheets that averages the results from a column derived from expected values in multiple columns

I'm an average user of Google sheets and I've tried writing/looking up the formula I'm going for, but I haven't had any luck yet.
I have a spreadsheet that details multiple values that I need to display in a single cell the average of a certain set of values derived from a specific set of those values from multiple columns.
The flow of information would look something along the lines of:
if value in Column D=L
then
if value in Column J<$1.20
then
Find Avg of all Values in Column N
I'd need the formula to narrow it's field of data each time so the final result was the average of all the values in Column N that had a value in column J<$1.20 with a value in Column D=L.
I feel like a dummy over here because I just can't narrow down how I should write this flow and get it to work right without adding multiple extra hidden columns. Can anyone help on this one?
I've tried writing the formula multiple different ways but haven't kept it written down to pass on.

PowerBI groupby with filters

My company has tasked with slicing the information on turnover and to create different graphs.
My source data looks like this: Relevant columns are: Voluntary/Involuntary, Termination Reason, Country, Production, and TermDateKey
I am trying to get counts using different filters on the data. I managed to get the basic monthly total using the formula:
Term Month Count = GROUPBY('Turnover Source','Turnover Source'[TermDateKey],"Turnover Total Count", COUNTX(CURRENTGROUP(),'Turnover Source'[TermDateKey]))
This gave me a new sheet with the counts for each month.
Table that shows TermDateKey on Column 1, and Counts on column 2
I am trying to add onto this table by adding counts but using different filters.
For example, I am trying to add another column that gives me the monthly count but filtered for 'Turnover Source'[Voluntary/Involuntary]=="Voluntary". Then another column for 'Turnover Source'[Voluntary/Involuntary]=="Involuntary" and so on. I have not found anywhere that shows me how to do this and when I add in the FILTER function it says that GROUPBY(...) can only work on CURRENTGROUP().
Can some one point me to a resource that will give me the solution I need? I am at a loss, thank you all.
It looks like you may not be aware that you don't have to calculate all possible groupings with DAX formulas.
The very nature of Power BI is that you use a column like "Termination Reason" on an X axis or in the legend of a visual. Any measure that you have created on values of another column, for e.g. a count of all rows, will then automatically be calculated to be grouped by the values in "Termination Reason", giving you a count of each of the values in the column.
You do NOT need DAX functions to calculate the grouping values for each measure for each column value combination.
Here is some simple sample data that has been grouped into dates and colours, one chart showing a count of each colour and one chart showing a sum of the Value column. No DAX was written for that.
If your scenario is different, please explain.

Scan on DynamDB table or Query on secondary global index or a local index (What's the best solution)

I have AWS DynamoDB table called "Users", whose hash key/primary key is "UserID" which consist of emails. It has two attributes, first called "Daily Points" and second "TimeSpendInTheApp". Now I need to run a query or scan on the table, that will give me top 50 users which have the highest points and top 50 users which have spend the most time in the app. Now this query will be executed only once a day by cron aws lambda. I am trying to find the best solutions for this query or scan. For me, the cost is most important than speed/or efficiency. As maintaining secondary global index or a local index on points can be costly operations, as I have to assign Read and Write units for those indexes, which I want to avoid. "Users" table will have a maximum of 100,000 to 150,000 records and on average it will have 50,000 records. What are my best options? Please suggest.
I am thinking, my first option is, I can scan the whole table on Filter Expression for records above certain points (5000 for example), after this scan, if 50 or more than 50 records are found, then simply sort the values and take the top 50 records. If this scan returns no or very less results then reduce the Filter Expression value (3000 for example), then again do the same scan operation. If Filter Expression value (2500 for example) returns too many records, like 5000 or more, then reduce the Filter Expression value. Is this even possible, I guess it would also need to handle pagination. Is it advisable to scan on a table which has 50,000 record?
Any advice or suggestion will be helpful. Thanks in advance.
Firstly, creating indexes for the above use case doesn't simplify the process as it doesn't have solution for aggregation or sorting.
I would export the data to HIVE and run the queries rather than writing code to determine the result especially as it is going to be a batch executed only once per day.
Something like below:-
Create Hive table:-
CREATE EXTERNAL TABLE hive_users(userId string, dailyPoints bigint, timeSpendInTheApp bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "Users",
"dynamodb.column.mapping" = "userId:UserID,dailyPoints:Daily_Points,timeSpendInTheApp:TimeSpendInTheApp");
Queries:-
SELECT dailyPoints, userId from hive_users sort by dailyPoints desc;
SELECT timeSpendInTheApp, userId from hive_users sort by timeSpendInTheApp desc;
Hive Reference

Access - Get the total of a column in a select query

I have an Access database set up that takes a bunch of raw data, splits things up in different 'select' queries and pipes the results into various CSV files, where a dashboard set up in Excel will pick it up.
There's some data that I'm trying to calculate in Access, namely I have a quantity field, and I need to calculate the percentage of for each record. In other words, quantity / total of quantity.
Using my rather limited Access abilities, I tried the following query:
SELECT [Sales].*, [Quantity] / Sum([Quantity]) AS QuantityPercent FROM [Sales];
Which comes up with an error:
Your query does not include the specified expression 'company_name' as part of an aggregate function.
Company_name is the first field of the table, and after some Googling and Binging, I'm still quite confused as to what it means in this context.
To sum it up, my question is this: Is there a way to calculate data based off the total of a column/field?
The easy method is to use DSum:
SELECT
[Sales].*,
[Quantity] / DSum("[Quantity]", "[Sales]") AS QuantityPercent
FROM
[Sales];

Aggregating data from Parse columns

Using Parse for the backend of my app. If I want to isolate data from a single column within a class (a la Excel spreadsheet column, to get a numerical sum of the data), is there a way to export a single column as numbers?
Parse doesn't have any aggregating queries like SQL, so your only option is to either keep totals updated using afterSave handlers (very common pattern), or get all the rows and aggregate the data yourself (queries are limited to 100 rows by default with 1000 as the max, so this option has issues).
To see an example of the afterSave pattern, look at this documentation:
https://parse.com/docs/cloud_code_guide#functions-aftersave

Resources