hive select count based on average of count - hadoop

Hi I am trying to work on finding the count which is higher than average using the below hive statement
Select x, Count(x) as y from data
group by x
Having Count(x) >= (select Avg(z.count1) as aveg
from (select x, Count(x) as count1 from data group by x ) z) ;
I am receiving error as ParseException line 1:87 cannot recognize input near 'Select' 'Avg' '(' in expression specification

select x
,cnt_x
from (select x
,count(x) as cnt_x
,avg (count(x)) over () as avg_cnt_x
from data
group by x
) t
where cnt_x >= avg_cnt_x

Related

Why maxMerge() gives different results than max() of sumMerge()?

The whole point is to get peaks per period (e.g. 5m peaks) for value that accumulates. So it needs to be summed per period and then the peak (maximum) can be found in those sums. (select max(v) from (select sum(v) from t group by a1, a2))
I have a base table t.
Data are inserted into t, consider two attributes (time t1 and some string a2) and one numeric value.
Value accumulates so it needs to be summed to get the total volume over certain period. Example of rows inserted:
t1 | a2 | v
----------------
date1 | b | 1
date2 | c | 20
I'm using a MV to compute sumState() and from that I get peaks using sumMerge() and then max().
I need it only for max values so I was wondering I could use maxState() directly.
So this is what I do now: I use MV that computes a 5m sum and from that I read max()
CREATE TABLE IF NOT EXISTS sums_table ON CLUSTER '{cluster}' (
t1 DateTime,
a2 String,
v AggregateFunction(sum, UInt32)
)
ENGINE = ReplicatedAggregatingMergeTree(
'...',
'{replica}'
)
PARTITION BY toDate(t1)
ORDER BY (a2, t1)
PRIMARY KEY (a2);
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_a
ON CLUSTER '{cluster}'
TO sums_table
AS
SELECT toStartOfFiveMinute(t1) AS t1, a2,
sumState(toUInt32(v)) AS v
FROM t
GROUP BY t1, a2
from that I'm able to read max of 5m sum for a2 using
SELECT
a2,
max(sum) AS max
FROM (
SELECT
t1,
a2,
sumMerge(v) AS sum
FROM sums_table
WHERE t1 BETWEEN :fromDateTime AND :toDateTime
GROUP BY t1, a2
)
GROUP BY a2
ORDER BY max DESC
That works perfectly.
So I wanted to achieve the same using maxState and maxMerge():
CREATE TABLE IF NOT EXISTS max_table ON CLUSTER '{cluster}' (
t1 DateTime,
a2 String,
max_v AggregateFunction(max, UInt32)
)
ENGINE = ReplicatedAggregatingMergeTree(
'...',
'{replica}'
)
PARTITION BY toDate(t1)
ORDER BY (a2, t1)
PRIMARY KEY (a2)
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_b
ON CLUSTER '{cluster}'
TO max_table
AS
SELECT
t1,
a2
maxState(v) AS max_v
FROM (
SELECT
toStartOfFiveMinute(t1) AS t1,
a2,
toUInt32(sum(v)) AS v
FROM t
GROUP BY t1, a2
)
GROUP BY t1, a2
and I thought if I get a max per time (t1) and a2, and then select max of that per a2, I'd get the maximum value for each a2, but I'm getting totally different max values using this query compared to the max of sums mentioned above.
SELECT
a2,
max(max) AS max
FROM (
SELECT
t1,
a2,
maxMerge(v) AS max
FROM max_table
WHERE t1 BETWEEN :fromDateTime AND :toDateTime
GROUP BY t1, a2
) maxs_per_time_and_a2
GROUP BY a2
What did I do wrong? Do I get MVs wrong? Is it possible to use maxState with maxMerge for 2+ attributes to compute max over a longer period, let's say year?
SELECT
t1,
a2
maxState(v) AS max_v
FROM (
SELECT
toStartOfFiveMinute(t1) AS t1,
a2,
toUInt32(sum(v)) AS v
FROM t
GROUP BY t1, a2
)
GROUP BY t1, a2
This is incorrect. And impossible.
Because MV is an insert trigger. It never reads REAL table t.
You are getting max from sum of rows in insert buffer.
If you insert 1 row with v=10. You will get max_v = 10. MatView does not "know" that a previous insert has added some rows, their sum is not taken into account.

Using Row Pivot Together

My Raw Query Is
select element1_name, element1_value, element2_name, element2_value
from reports r
I Want Convert element_name,element_value Rows To Columns So Write This Query
select *
from (select *
from (select element1_name,
element1_value,
element2_name,
element2_value
from reports r)
pivot(max(element1_value) as one
for element1_name in('C' as C, 'Si' as SI, 'P' as P)))
pivot(max(element2_value) as tow
for element2_name in('C' as C, 'Si' as SI, 'P' as P))
There Is A Way That Write Two Pivot Together Without Two Sub Query Like this
Select * (...) pivot element1,pivot element2
Question:How to optimize this query?
You can try below query -
select *
from (select element1_name,
element1_value,
element2_name,
element2_value
from reports) AS R
pivot(max(element1_value) for element1_name in('C' as C, 'Si' as SI, 'P' as P)) as PV1
pivot(max(element2_value) for element2_name in('C' as C, 'Si' as SI, 'P' as P)) as PV2;

Table of all combinations possible with combination id

Here come a little ABAP challenge:
For an ABAP projet, i must build from an internal table with 2 columns (example1) another table containing all combinations possibles (example2).
"X" columns represent the parameter. "Y" represent the parameter value.
example1:
X(param)
Y(value)
A a1
A a2
A a3
B b1
B b2
C c1
C c2
In the result table (example2):
We must get all combinations with a numeric id (on 3 columns).
The new "z" column represent the combination id. For each combination there is a number of lines equal to the number of dictinct parameters(in our case 3 line for A,B and C).
"x" column still represent the parameter and "y" column the associated value.
example2:
z(combi num)
x(param)
y(value)
1 A a1
1 B b1
1 C c1
2 A a1
2 B b1
2 C c2
3 A a1
3 B b2
3 C c1
4 A a1
4 B b2
4 C c2
etc...
etc...
etc...
12 A a3
12 B b2
12 C c2
Another remark is that the number of parameters and the number of values per parameters is not fixed (the initial internal table can evolve a lot and so the combinations possibles).
We maybe need recursion but i'm not sure of it.
Here is a non-recursive way to do it, you might have to rewrite the parts that use the new 740 syntax. The idea is pretty simple, first transform the data into an internal table with one entry per parameter containing a table with the possible values, the LOOP loop. From there it is a simple matter of going through all the combinations and adding these to another internal table, the WHILE loop.
REPORT z_algorithm.
TYPES: ty_param TYPE char1,
ty_value TYPE char2,
BEGIN OF ty_struct,
x TYPE ty_param,
y TYPE ty_value,
END OF ty_struct,
BEGIN OF ty_combi,
z TYPE i,
s TYPE ty_struct,
END OF ty_combi.
TYPES: BEGIN OF ty_param_struct,
x TYPE ty_param,
ys TYPE STANDARD TABLE OF ty_value WITH DEFAULT KEY,
ix TYPE i,
END OF ty_param_struct.
DATA: tab TYPE STANDARD TABLE OF ty_struct,
params TYPE STANDARD TABLE OF ty_param_struct,
done TYPE abap_bool VALUE abap_false,
z TYPE i VALUE 0,
overflow TYPE abap_bool VALUE abap_false,
combis TYPE STANDARD TABLE OF ty_combi.
START-OF-SELECTION.
APPEND VALUE: #( x = 'A' y = 'a1' ) TO tab,
#( x = 'A' y = 'a2' ) TO tab,
#( x = 'A' y = 'a3' ) TO tab,
#( x = 'B' y = 'b1' ) TO tab,
#( x = 'B' y = 'b2' ) TO tab,
#( x = 'C' y = 'c1' ) TO tab,
#( x = 'C' y = 'c2' ) TO tab.
LOOP AT tab ASSIGNING FIELD-SYMBOL(<tab>).
READ TABLE params WITH KEY x = <tab>-x ASSIGNING FIELD-SYMBOL(<param>).
IF sy-subrc NE 0.
APPEND INITIAL LINE TO params ASSIGNING <param>.
<param>-x = <tab>-x.
<param>-ix = 1.
ENDIF.
APPEND <tab>-y TO <param>-ys.
ENDLOOP.
WHILE done EQ abap_false.
ADD 1 TO z.
overflow = abap_true.
done = abap_true.
LOOP AT params ASSIGNING <param>.
READ TABLE <param>-ys INDEX <param>-ix ASSIGNING FIELD-SYMBOL(<y>).
APPEND VALUE #( z = z s-x = <param>-x s-y = <y> ) TO combis.
IF overflow EQ abap_true.
ADD 1 TO <param>-ix.
ENDIF.
IF <param>-ix GT lines( <param>-ys ).
overflow = abap_true.
<param>-ix = 1.
ELSE.
overflow = abap_false.
done = abap_false.
ENDIF.
ENDLOOP.
ENDWHILE.

Why this if statment only work at first step

calculateSum(_, _List, _Row, _Col, []).
calculateSum([M|Rest],List,Row,Col,[Y|Tail]):-
Col == Row -> Col1 is Col + 1,calculateSum(List,List,Row1,Col1,Tail);
calcHeu(Rest,L),
sum(L,S),
index(List, Row, Col, V),
Y is V + S,
%Row1 is Row + 1,
Col1 is Col + 1,
calculateSum(List,List,Row1,Col1,Tail).
Why this Col == Row if statement doest work. Is there any other way so that if Row == Col skip that step?
EDIT
By doing some thing like this.
(Col \= Row ->
calcHeu(Rest,L),
sum(L,S),
index(List, Row, Col, V),
Y is V + S,
Col1 is Col + 1,
calculateSum(List,List,Row1,Col1,Tail)
;
Col1 is Col + 1,calculateSum(List,List,Row1,Col1,Tail)
).
it print out [22,,,_...... infinitly
It's a bit hard to tell without knowing what your code is supposed to do and what inputs it gets, but you can definitely break your if-then-else statement into two rules with a cut (which in my opinion is preferable anyways: try to avoid using ";" as much as possible).
Try this (note that I've change the fir occurrence of "List" in the recursive call for "Rest", because I think that's what you want anyways):
calculateSum(_, _List, _Row, _Col, []).
calculateSum([_M|Rest],List,Row,Row,[_Y|Tail]):-
!,
Col1 is Col + 1,
calculateSum(Rest,List,Row1,Col1,Tail).
calculateSum([M|Rest],List,Row,Col,[Y|Tail]):-
calcHeu(Rest,L),
sum(L,S),
index(List, Row, Col, V),
Y is V + S,
Col1 is Col + 1,
calculateSum(Rest,List,Row1,Col1,Tail).

linq: what is the expression tree syntax for cross join

how to write this linq query in expression tree syntax
from x in 100.To(999)
from y in 100.To(999)
let product = x * y
where product.IsEven()
select product
The equivalent to the 'from x from y select' is the 'SelectMany' keyword used with an additional 'Select':
100.To(999).SelectMany(x => 100.To(999).Select(y => x * y))
.Where(x => x.IsEven())

Resources