Hive - same sequence of records in array

Hive - same sequence of records in array - hadoop

I have a table with data at hour level. I want to find the count of hours and the values for col1 and col2 for all hours in an array. Input Table
+-----+-----+-----+
| hour| col1| col2|
+-----+-----+-----+
| 00 | 0.0 | a |
| 04 | 0.1 | b |
| 08 | 0.2 | c |
| 12 | 0.0 | d |
+-----+-----+-----+
I am using the below query to get the column values in an array
Query:
select count(hr), map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col1 as string)))))) as col1_arr, map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col2 as string)))))) as col2_arr from table;
Output that i am getting, values in col2_arr are not in the same sequence with col1_arr. Please suggest how can i get the values in array/list for different columns in same sequence.
+----------+-----------------+----------+
| count(hr)| col1_arr | col2_arr |
+----------+-----------------+----------+
| 4 | 0.0,0.1,0.2,0.0 | b,a,c,d |
+----------+----------------+-----------+
Required output:
+----------+-----------------+----------+
| count(hr)| col1_arr | col2_arr |
+----------+-----------------+----------+
| 4 | 0.0,0.1,0.2,0.0 | a,b,c,d |
+----------+----------------+-----------+
Thanks

select count(*) as cnt
,concat_ws(',',sort_array(collect_list(hour))) as hour
,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',hour,cast(col1 as string))))),'..:','') as col1
,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',hour,col2)))),'..:','') as col2
from mytable
;
+-----+-------------+-------------+---------+
| cnt | hour | col1 | col2 |
+-----+-------------+-------------+---------+
| 4 | 00,04,08,12 | 0,0.1,0.2,0 | a,b,c,d |
+-----+-------------+-------------+---------+

Related

How to categorize to age ranges and find the top 3?

Having a hive table with age column consisting of age of persons.
Have to count and display the top 3 age categories.
Ex: whether below 10, 10-15, 15-20, 20-25, 25-30, ...
Which age category appears more.
Please suggest me a query to do this.

select case
when age <= 10 then '0-10'
else concat_ws
(
'-'
,cast(floor(age/5)*5 as string)
,cast((floor(age/5)+1)*5 as string)
)
end as age_group
,count(*) as cnt
from mytable
group by 1
order by cnt desc
limit 3
;
You might need to set this parameter:
set hive.groupby.orderby.position.alias=true;
Demo
with mytable as
(
select floor(rand()*100) as age
from (select 1) x lateral view explode(split(space(100),' ')) pe
)
select case
when age <= 10 then '0-10'
else concat_ws('-',cast(floor(age/5)*5 as string),cast((floor(age/5)+1)*5 as string))
end as age_group
,count(*) as cnt
,sort_array(collect_list(age)) as age_list
from mytable
group by 1
order by cnt desc
;
+-----------+-----+------------------------------+
| age_group | cnt | age_list |
+-----------+-----+------------------------------+
| 0-10 | 9 | [0,0,1,3,3,6,8,9,10] |
| 25-30 | 9 | [26,26,28,28,28,28,29,29,29] |
| 55-60 | 8 | [55,55,56,57,57,57,58,58] |
| 35-40 | 7 | [35,35,36,36,37,38,39] |
| 80-85 | 7 | [80,80,81,82,82,82,84] |
| 30-35 | 6 | [31,32,32,32,33,34] |
| 70-75 | 6 | [70,70,71,71,72,73] |
| 65-70 | 6 | [65,67,67,68,68,69] |
| 50-55 | 6 | [51,53,53,53,53,54] |
| 45-50 | 5 | [45,45,48,48,49] |
| 85-90 | 5 | [85,86,87,87,89] |
| 75-80 | 5 | [76,77,78,79,79] |
| 20-25 | 5 | [20,20,21,22,22] |
| 15-20 | 5 | [17,17,17,18,19] |
| 10-15 | 4 | [11,12,12,14] |
| 95-100 | 4 | [95,95,96,99] |
| 40-45 | 3 | [41,44,44] |
| 90-95 | 1 | [93] |
+-----------+-----+------------------------------+

Auto increment without sequence

I have a select statement that generate set value thereafter I want insert that set of values into another table, MY concern is I'm using select statement in select I'm using one one more select clause((select max(org_id)+1 from org)) where I'm trying to get max value and increment by one but I'm not able get incremented value instead I'm getting same value you can see column name id_limit
select abc,abc1,abc3,abc4,(select max(org_id)+1 from org) as id_limit from xyz
current output
-----------------------------------------------------------------
| abc | abc1 | abc3 | abc4 | id_limit |
----------------------------------------------------------------|
| BUSINESS_UNIT | 0 | 100 | London | 6 |
| BUSINESS_UNIT | 0 | 200 | Sydney | 6 |
| BUSINESS_UNIT | 0 | 300 | Kiev | 6 |
-----------------------------------------------------------------
I'm trying to get expected out output
-----------------------------------------------------------------
| abc | abc1 | abc3 | abc4 | id_limit |
----------------------------------------------------------------|
| BUSINESS_UNIT | 0 | 100 | London | 6 |
| BUSINESS_UNIT | 0 | 200 | Sydney | 7 |
| BUSINESS_UNIT | 0 | 300 | Kiev | 8 |
-----------------------------------------------------------------

Yes, in Oracle 12.
create table foo (
id number generated by default on null as identity
);
https://oracle-base.com/articles/12c/identity-columns-in-oracle-12cr1
In previous versions you use sequence/trigger as explained here:
How to create id with AUTO_INCREMENT on Oracle?

Oracle query execution plan

I am little confused over the execution plan of an Oracle query. This is in Oracle Enterprise Edition 11.2.0.1.0 on platform IBM AIX 6.1. I have a table TEST1 (1 million rows) and another table TEST2 (50,000 rows). Both tables have identical columns. There is a view created as a union of these 2 tables. I am firing a query on this view, with an indexed column in the WHERE clause. What I could find is that the index is not used and a full table scan is resulted. With a slight modification of the query, it started using the index. I am wondering how this particular change can result in the plan change.
Please find the complete DDL + DML below. I have given simplified example. Actual schema and requirements are bit more complex. In fact the query in question is dynamically constructed and executed by an OCI code generator. My intention here is not to get alternatives, but to really understand what could be the logical reasoning behind the plan change (between, I am an application programmer and not a database administrator). Your help is much appreciated.
DROP TABLE TEST1 CASCADE CONSTRAINTS ;
DROP TABLE TEST2 CASCADE CONSTRAINTS ;
CREATE TABLE TEST1
(
ID NUMBER(20) NOT NULL,
NAME VARCHAR2(40),
DAY NUMBER(20)
)
PARTITION BY RANGE (DAY)
(
PARTITION P001 VALUES LESS THAN (2),
PARTITION P002 VALUES LESS THAN (3),
PARTITION P003 VALUES LESS THAN (4),
PARTITION P004 VALUES LESS THAN (5),
PARTITION P005 VALUES LESS THAN (6),
PARTITION P006 VALUES LESS THAN (7),
PARTITION P007 VALUES LESS THAN (8),
PARTITION P008 VALUES LESS THAN (9),
PARTITION P009 VALUES LESS THAN (10),
PARTITION P010 VALUES LESS THAN (11),
PARTITION P011 VALUES LESS THAN (12),
PARTITION P012 VALUES LESS THAN (13),
PARTITION P013 VALUES LESS THAN (14),
PARTITION P014 VALUES LESS THAN (15),
PARTITION P015 VALUES LESS THAN (16),
PARTITION P016 VALUES LESS THAN (17),
PARTITION P017 VALUES LESS THAN (18),
PARTITION P018 VALUES LESS THAN (19),
PARTITION P019 VALUES LESS THAN (20),
PARTITION P020 VALUES LESS THAN (21),
PARTITION P021 VALUES LESS THAN (22),
PARTITION P022 VALUES LESS THAN (23),
PARTITION P023 VALUES LESS THAN (24),
PARTITION P024 VALUES LESS THAN (25),
PARTITION P025 VALUES LESS THAN (26),
PARTITION P026 VALUES LESS THAN (27),
PARTITION P027 VALUES LESS THAN (28),
PARTITION P028 VALUES LESS THAN (29),
PARTITION P029 VALUES LESS THAN (30),
PARTITION P030 VALUES LESS THAN (31)
) ;
CREATE INDEX IX_ID on TEST1 (ID) INITRANS 4 STORAGE(FREELISTS 16) LOCAL
(
PARTITION P001,
PARTITION P002,
PARTITION P003,
PARTITION P004,
PARTITION P005,
PARTITION P006,
PARTITION P007,
PARTITION P008,
PARTITION P009,
PARTITION P010,
PARTITION P011,
PARTITION P012,
PARTITION P013,
PARTITION P014,
PARTITION P015,
PARTITION P016,
PARTITION P017,
PARTITION P018,
PARTITION P019,
PARTITION P020,
PARTITION P021,
PARTITION P022,
PARTITION P023,
PARTITION P024,
PARTITION P025,
PARTITION P026,
PARTITION P027,
PARTITION P028,
PARTITION P029,
PARTITION P030
) ;
CREATE TABLE TEST2
(
ID NUMBER(20) PRIMARY KEY NOT NULL,
NAME VARCHAR2(40),
DAY NUMBER(20)
) ;
CREATE OR REPLACE VIEW TEST_V AS
SELECT
ID, NAME, DAY
FROM
TEST1
UNION
SELECT
ID, NAME, DAY
FROM
TEST2 ;
begin
for count in 1..1000000
loop
insert into test1 values(count, 'John', mod(count, 30) + 1) ;
end loop ;
end ;
/
begin
for count in 1000000..1050000
loop
insert into test2 values(count, 'Mary', mod(count, 30) + 1) ;
end loop ;
end ;
/
commit ;
set lines 300 ;
set pages 1000 ;
-- Actual query
explain plan for
SELECT Key FROM
(
WITH recs AS
(
SELECT * FROM TEST_V WHERE ID = 70000
)
(
SELECT 1 AS Key FROM recs WHERE NAME = 'John'
)
UNION
(
SELECT 2 AS Key FROM recs WHERE NAME = 'Mary'
)
) ;
select * from table(dbms_xplan.display()) ;
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | TempSpc | Cost (%CPU) | Time | Pstart | Pstop |
------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1611K | 4721K | | 13559 (1) | 00:02:43 | | |
| 1 | VIEW | | 1611K | 4721K | | 13559 (1) | 00:02:43 | | |
| 2 | TEMP TABLE TRANSFORMATION | | | | | | | | |
| 3 | LOAD AS SELECT | SYS_TEMP_0FD9D6610_34D3B6C | | | | | | | |
|* 4 | VIEW | TEST_V | 805K | 36M | | 10403 (1) | 00:02:05 | | |
| 5 | SORT UNIQUE | | 805K | 36M | 46M | 10403 (8) | 00:02:05 | | |
| 6 | UNION-ALL | | | | | | | | |
| 7 | PARTITION RANGE ALL | | 752K | 34M | | 721 (1) | 00:00:09 | 1 | 30 |
| 8 | TABLE ACCESS FULL | TEST1 | 752K | 34M | | 721 (1) | 00:00:09 | 1 | 30 |
| 9 | TABLE ACCESS FULL | TEST2 | 53262 | 2496K | | 68 (0) | 00:00:01 | | |
| 10 | SORT UNIQUE | | 1611K | 33M | 43M | 13559 (51) | 00:02:43 | | |
| 11 | UNION-ALL | | | | | | | | |
|* 12 | VIEW | | 805K | 16M | | 1429 (1) | 00:00:18 | | |
| 13 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6610_34D3B6C | 805K | 36M | | 1429 (1) | 00:00:18 | | |
|* 14 | VIEW | | 805K | 16M | | 1429 (1) | 00:00:18 | | |
| 15 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6610_34D3B6C | 805K | 36M | | 1429 (1) | 00:00:18 | | |
------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("ID"=70000)
12 - filter("NAME"='John')
14 - filter("NAME"='Mary')
-- Modified query (only change is absence of outermost SELECT)
explain plan for
WITH recs AS
(
SELECT * FROM TEST_V WHERE ID = 70000
)
(
SELECT 1 AS Key FROM recs WHERE NAME = 'John'
)
UNION
(
SELECT 2 AS Key FROM recs WHERE NAME = 'Mary'
) ;
select * from table(dbms_xplan.display()) ;
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | Pstart | Pstop |
-----------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 4 | 88 | 6 (67) | 00:00:01 | | |
| 1 | TEMP TABLE TRANSFORMATION | | | | | | | |
| 2 | LOAD AS SELECT | SYS_TEMP_0FD9D6611_34D3B6C | | | | | | |
| 3 | VIEW | TEST_V | 2 | 96 | 4 (50) | 00:00:01 | | |
| 4 | SORT UNIQUE | | 2 | 96 | 4 (75) | 00:00:01 | | |
| 5 | UNION-ALL | | | | | | | |
| 6 | PARTITION RANGE ALL | | 1 | 48 | 1 (0) | 00:00:01 | 1 | 30 |
| 7 | TABLE ACCESS BY LOCAL INDEX ROWID | TEST1 | 1 | 48 | 1 (0) | 00:00:01 | 1 | 30 |
|* 8 | INDEX RANGE SCAN | IX_ID | 1 | | 1 (0) | 00:00:01 | 1 | 30 |
| 9 | TABLE ACCESS BY INDEX ROWID | TEST2 | 1 | 48 | 1 (0) | 00:00:01 | | |
|* 10 | INDEX UNIQUE SCAN | SYS_C001242692 | 1 | | 1 (0) | 00:00:01 | | |
| 11 | SORT UNIQUE | | 4 | 88 | 6 (67) | 00:00:01 | | |
| 12 | UNION-ALL | | | | | | | |
|* 13 | VIEW | | 2 | 44 | 2 (0) | 00:00:01 | | |
| 14 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6611_34D3B6C | 2 | 96 | 2 (0) | 00:00:01 | | |
|* 15 | VIEW | | 2 | 44 | 2 (0) | 00:00:01 | | |
| 16 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6611_34D3B6C | 2 | 96 | 2 (0) | 00:00:01 | | |
-----------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
8 - access("ID"=70000)
10 - access("ID"=70000)
13 - filter("NAME"='John')
15 - filter("NAME"='Mary')
quit ;
thanks & regards,
Reji

I can not reproduce this in 11.2.0.3, I don't think there is a logical explanation for this behavior other than: you hit a bug, that apparently is solved in 11.2.0.3.
One thing that jumped immediately in my eye is the lack of object statistics and - if your output was complete - the fact that OPTIMIZER_DYNAMIC_SAMPLING is set to 0. You could try to reproduce with OPTIMIZER_DYNAMIC_SAMPLING=2. In that case the dynamic sampler kicks in if the object statistics are missing. BTW: don't use this feature instead of correct optimizer statistics. More info about dynamic sampling Dynamic sampling and its impact on the Optimizer
In your - nice documented - question and script/test case you try to make use of append and nologging. This only works for bulk inserts, not for row inserts with values. What would happen is for every insert: push-up the highwater mark and dump a full block of data in the free block, in your case that would have only 1 row .... Luckily, the database ignores this instruction.
Before you fire SQL to a table, make sure that you give it optimizer statistics. This will certainly help your case.

80% Rule Estimation Value in PL/SQL

Assume a range of values inserted in a schema table and in the end of the month i want to apply for these records (i.e. 2500 rows = numeric values) the algorithm: sort the values descending (from the smallest to highest value) and then find the 80% value of the sorted column.
In my example, if each row increases by one starting from 1, the 80% value will be the 2000 row=value (=2500-2500*20/100). This algorithm needs to be implemented in a procedure where the number of rows is not constant, for example it can varries from 2500 to 1,000,000 per month

Hint: You can achieve this using Oracle's cumulative aggregate functions. For example, suppose your table looks like this:
MY_TABLE
+-----+----------+
| ID | QUANTITY |
+-----+----------+
| A | 1 |
| B | 2 |
| C | 3 |
| D | 4 |
| E | 5 |
| F | 6 |
| G | 7 |
| H | 8 |
| I | 9 |
| J | 10 |
+-----+----------+
At each row, you can sum the quantities so far using this:
SELECT
id,
quantity,
SUM(quantity)
OVER (ORDER BY quantity ROWS UNBOUNDED PRECEDING)
AS cumulative_quantity_so_far
FROM
MY_TABLE
Giving you:
+-----+----------+----------------------------+
| ID | QUANTITY | CUMULATIVE_QUANTITY_SO_FAR |
+-----+----------+----------------------------+
| A | 1 | 1 |
| B | 2 | 3 |
| C | 3 | 6 |
| D | 4 | 10 |
| E | 5 | 15 |
| F | 6 | 21 |
| G | 7 | 28 |
| H | 8 | 36 |
| I | 9 | 45 |
| J | 10 | 55 |
+-----+----------+----------------------------+
Hopefully this will help in your work.

Write a query using the percentile_disc function to solve your problem. Sounds like it does what you want.
An example would be
select percentile_disc(0.8) within group (order by the_value)
from my_table

MySQL equivalent of ORACLES rank()

Oracle has 2 functions - rank() and dense_rank() - which i've found very useful for some applications. I am doing something in mysql now and was wondering if they have something equivalent to those?

Nothing directly equivalent, but you can fake it with some (not terribly efficient) self-joins. Some sample code from a collection of MySQL query howtos:
SELECT v1.name, v1.votes, COUNT(v2.votes) AS Rank
FROM votes v1
JOIN votes v2 ON v1.votes < v2.votes OR (v1.votes=v2.votes and v1.name = v2.name)
GROUP BY v1.name, v1.votes
ORDER BY v1.votes DESC, v1.name DESC;
+-------+-------+------+
| name | votes | Rank |
+-------+-------+------+
| Green | 50 | 1 |
| Black | 40 | 2 |
| White | 20 | 3 |
| Brown | 20 | 3 |
| Jones | 15 | 5 |
| Smith | 10 | 6 |
+-------+-------+------+

how about this "dense_rank implement" in MySQL
CREATE TABLE `person` (
`id` int(11) DEFAULT NULL,
`first_name` varchar(20) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`gender` char(1) DEFAULT NULL);
INSERT INTO `person` VALUES
(1,'Bob',25,'M'),
(2,'Jane',20,'F'),
(3,'Jack',30,'M'),
(4,'Bill',32,'M'),
(5,'Nick',22,'M'),
(6,'Kathy',18,'F'),
(7,'Steve',36,'M'),
(8,'Anne',25,'F'),
(9,'Mike',25,'M');
the data before dense_rank() like this
mysql> select * from person;
+------+------------+------+--------+
| id | first_name | age | gender |
+------+------------+------+--------+
| 1 | Bob | 25 | M |
| 2 | Jane | 20 | F |
| 3 | Jack | 30 | M |
| 4 | Bill | 32 | M |
| 5 | Nick | 22 | M |
| 6 | Kathy | 18 | F |
| 7 | Steve | 36 | M |
| 8 | Anne | 25 | F |
| 9 | Mike | 25 | M |
+------+------------+------+--------+
9 rows in set (0.00 sec)
the data after dense_rank() like this,including "partition by" function
+------------+--------+------+------+
| first_name | gender | age | rank |
+------------+--------+------+------+
| Anne | F | 25 | 1 |
| Jane | F | 20 | 2 |
| Kathy | F | 18 | 3 |
| Steve | M | 36 | 1 |
| Bill | M | 32 | 2 |
| Jack | M | 30 | 3 |
| Mike | M | 25 | 4 |
| Bob | M | 25 | 4 |
| Nick | M | 22 | 6 |
+------------+--------+------+------+
9 rows in set (0.00 sec)
the query statement is
select first_name,t1.gender,age,FIND_IN_SET(age,t1.age_set) as rank from person t2,
(select gender,group_concat(age order by age desc) as age_set from person group by gender) t1
where t1.gender=t2.gender
order by t1.gender,rank

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive - same sequence of records in array - hadoop

Related

How to categorize to age ranges and find the top 3?

Auto increment without sequence

Oracle query execution plan

80% Rule Estimation Value in PL/SQL

MySQL equivalent of ORACLES rank()

Categories

Resources