Split a range of numbers in equal parts in hive - hadoop

I have a table with 40 lac rows with a column 'playcount' which has min value 1 and max value of around 17,000.
I would like to split this table into 15 groups by adding a column which will have values 1 to 15 based on 'playcount' column.
Hive has a function NTILE which allows to do something similar. Here if I did NTILE(15) OVER (ORDER BY playcount) AS mygroup, it does break it up but based on count of playcount values and as the lower values are lot lot more( more than 50% have values less than 5), the grouping is such that values over 35 have group value of 15(maximum).
I would like to do the grouping based on the playcount and not on count of playcount values.
Is something similar possible in hive.
Thanks

One possibility I can think of is playcount%15 as mygroup.

Related

Oracle counting distinct rows using two columns merged as one

I have one table where each row has three columns. The first two columns are a prefix and a value. The third column is what I'm trying to get a distinct count for columns one/two.
Basically I'm trying to get to this.
Account
Totals
prefix & value1
101
prefix & value2
102
prefix & value3
103
I've tried a lot of different versions but I'm basically noodling around this.
select prefix||value as Account, count(distinct thirdcolumn) as Totals from Transactions
It sounds like you want
SELECT
prefix||value Account,
count(distinct thirdcolumn) Totals
FROM Transactions
GROUP BY prefix, value
The count(distinct thirdcolumn) says you want a count of the distinct values in the third column. The GROUP BY prefix, value says you want a row returned for each unique prefix/value combination and that the count applies to that combination.
Note that "thirdcolumn" is a placeholder for the name of your third column, not a magic keyword, since I didn't see the actual name in the post.
If you want the number of rows for each prefix/value pair then you can use:
SELECT prefix || value AS account,
COUNT(*) AS totals
FROM Transactions
GROUP BY prefix, value
You do not want to count the DISTINCT values for prefix/value as if you GROUP BY those values then different values for the pairs will be in different groups so the COUNT of DISTINCT prefix/value pairs would always be one.

How to find the values in one column is greater than the values in another column by how many times?

I have two columns and I need to find how many times the values in one column is greater than the values in another column in DAX.
For Example:
Column_A Column_B
8.07 0.70
I want to know how many times is Column A(8.07) is greater than Column_B(0.07)?
MyMeasure = COUNTROWS(
FILTER(MyTable,
MyTable[Column_A] > MyTable[Column_B]))

DAX - Meassure that sums only the first occurance by group

I'm trying to figure out how to build a measure that sums a total, but only taking the first non-empty row for a user.
For example, my data looks like the below:
date user value
-----------------
1/1/17 a 15
2/1/17 a 12
1/1/17 b null
5/1/17 b 3
I'd therefore like a result of 18 (15 + 3).
I'm thinking that using FIRSTNONBLANK would help, but it only takes a single column, I'm not sure how to give it the grouping - perhaps some sort of windowing is required.
I've tried the below, but am struggling to work out what the correct syntax is
groupby(
GROUPBY (
myTable,
myTable[user],
“Total”, SUMX(CURRENTGrOUP(), FIRSTNONBLANK( [value],1 ))
),
sum([total])
)
I didn't have much luck getting FIRSTNONBLANK or GROUPBY to work exactly how I wanted, but I think I found something that works:
SUMX(
ADDCOLUMNS(
ADDCOLUMNS(VALUES(myTable[User]),
"FirstDate",
CALCULATE(MIN(myTable[Date]),
NOT(ISBLANK(myTable[Value])))),
"FirstValue",
CALCULATE(SUM(myTable[Value]),
FILTER(myTable, myTable[Date] = [FirstDate]))),
[FirstValue])
The inner ADDCOLUMNS calculates the first non-blank date values for each user in the filter context.
The next ADDCOLUMNS, takes that table of users and first dates and for each user sums each [value] that occurred on each respective date.
The outer SUMX takes that resulting table and totals all of the values of [FirstValue].

In sqoop what does "size" mean when used with --split-limit arguments

From sqoop docs
Using the --split-limit parameter places a limit on the size of the split section created. If the size of the split created is larger than the size specified in this parameter, then the splits would be resized to fit within this limit, and the number of splits will change according to that.
What does "size" refer to here. Can some one explain with a little example.
I was just reading this and I think it would be interpreted like this.
Example table has a Primary Key col called ID and is an INT and table has 1000 rows with the ID values from 1 to 1000. if you set num-mappers to 50 then you would have 50 tasks each try to import 20 rows. The first query would have a predicate that says WHERE ID >= 1 AND ID <= 20. The 2nd mapper would say WHERE ID >= 21 AND ID <= 40... and so on.
If you also define the split-limit then depending on the size of the splits this parameter may adjust the number of tasks used to sqoop the data.
For example, with num-mappers set to 50 and split-limit set to 10, you would now need 100 tasks to import 10 rows of data each to get all 1000 rows. Your first task would now saw WHERE ID >= 1 AND ID <= 10.
In the case of a DateTime column, the value is now based on seconds. So If you have 10 years of data with 1 row for every day you would have about 3,653 rows of data. If you set num-mappers to 10 then your tasks would each try to sqoop about 365 days of data with a predicate that looked something like MYDATETIMECOL >= '2010-01-01' AND MYDATETIMECOL <= '2010-12-31' but if you also set the split-limit to something like 2592000 (num of seconds in 30 days) then you would need about 122 tasks to sqoop the data and the first task would have a predicate like MYDATETIMECOL >= '2010-01-01' AND MYDATETIMECOL <= '2010-01-30'.
These two examples have both used a 1:1 ratio for column value to row count. If either of these tables had 1000 rows per value in the split-by col then ALL of those rows would be sqooped as well.
Example with DateTime col where every day you have loaded 1000 rows for the last 10 years and now you have 3,653,000 rows, the predicates and the number of tasks would be the same but the number of rows sqooped in each of those tasks would be 1000x more.

Stacked column Flash chart counting all values

I am building stacked column flash chart on my query. I would like to split values in column for different locations. For argument sake I have 5 ids in location 41, 3 ids in location 21, 8 ids in location 1
select
'' link,
To_Char(ENQUIRED_DATE,'MON-YY') label,
count(decode(location_id,41,id,0)) "location1",
count(decode(location_id,21,id,0)) "location2",
count(decode(location_id,1,id,0)) "location3"
from "my_table"
where
some_conditions = 'Y';
as a result of this query Apex is creating stacked column with three separate parts( hurray!), however it instead of having values 5,3 and 8, it returns three regions 16,16,16. ( 16 = 5 +3+8).
So obviously Apex is going through all decode conditions and adding all values.
I am trying to achieve something described in this
article
Apex doesn't appear to be doing anything funky, you'd get the same result running that query through SQL*Plus. When you do:
count(decode(location_id,41,id,0)) "location1",
.. then the count gets incremented for every row - it doesn't matter which column you include, and the zero is just treated as any fixed value. I think you meant to use sum:
sum(decode(location_id,41,1,0)) "location1",
Here each row is assigned either zero or one, and summing those gives you the number that got one, which is the number that had the specified id value.
Personally I'd generally use caseover decode, but the result is the same:
sum(case when location_id = 41 then 1 else 0 end) "location1",

Resources