In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present? - hadoop

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.

Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

Related

How should I index a FULLNAME field in Oracle when I need to query by first and last name?

I have a rather large table (34 GB, 77M rows) which contains payment information. The table is partitioned by payment date because users usually care about small ranges of dates so the partition pruning really helps queries to return quickly.
The problem is that I have a user who wants to find out all payments that have ever been made to certain people.
Names are stored in columns NAME1 and NAME2, which are both VARCHAR2(40 Byte) and hold free-form full name data. For example, John Q Public could appear in either column as:
John Q Public
John Public
Public, John Q
or even embedded in the middle of the field, like "Estate of John Public"
Right now, the way the query is set up is to look for
NAME1||NAME2 LIKE '%JOHN%PUBLIC%' OR NAME1||NAME2 LIKE '%PUBLIC%JOHN%' and as you can imagine, the performance sucks.
Is this a job for Oracle Text? How else could I better index the atomic bits of the columns so that the user can search by first/last name?
Database Version: Oracle 12c (12.1.0.2.0)
Create a multi-column index on both names and modify your query to use an INDEX FAST FULL SCAN operation.
Traversing a b-tree index is a great way to quickly find a small amount of data. Unfortunately the leading wildcards ruin that access path for your query. However, Oracle has multiple ways of reading data from an index. The INDEX FAST FULL SCAN operation simply reads all of the index blocks in no particular order, as if the index was a skinny table. Since the average row length of your table is 442 bytes, and the two columns use at most 80 bytes, reading all the names in the index may be much faster than scanning the entire table.
But the index alone probably isn't enough. You need to change the concatenation into multiple OR expressions.
Sample schema:
--Create payment table and index on name columns.
create table payment
(
id number,
paydate date,
other_data varchar2(400),
name1 varchar2(40),
name2 varchar2(40)
);
create index payment_idx on payment(name1, name2);
--Insert 100K sample rows.
insert into payment
select level, sysdate + level, lpad('A', 400, 'A'), level, level
from dual
connect by level <= 100000;
--Insert two rows with relevant values.
insert into payment values(0, sysdate, 'other data', 'B JOHN B PUBLIC B', 'asdf');
insert into payment values(0, sysdate, 'other data', 'asdf', 'C JOHN C PUBLIC C');
commit;
--Gather stats to help optimizer pick the right plan.
begin
dbms_stats.gather_table_stats(user, 'payment');
end;
/
Original expression uses a full table scan:
explain plan for
select name1, name2
from payment
where NAME1||NAME2 LIKE '%JOHN%PUBLIC%' OR NAME1||NAME2 LIKE '%PUBLIC%JOHN%';
select * from table(dbms_xplan.display);
Plan hash value: 684176532
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9750 | 4056K| 1714 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| PAYMENT | 9750 | 4056K| 1714 (1)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("NAME1"||"NAME2" LIKE '%JOHN%PUBLIC%' OR "NAME1"||"NAME2"
LIKE '%PUBLIC%JOHN%')
New expression uses a faster INDEX FAST FULL SCAN operation:
explain plan for
select name1, name2
from payment
where
NAME1 LIKE '%JOHN%PUBLIC%' OR
NAME1 LIKE '%PUBLIC%JOHN%' OR
NAME2 LIKE '%JOHN%PUBLIC%' OR
NAME2 LIKE '%PUBLIC%JOHN%';
select * from table(dbms_xplan.display);
Plan hash value: 1655289165
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 18550 | 217K| 152 (3)| 00:00:01 |
|* 1 | INDEX FAST FULL SCAN| PAYMENT_IDX | 18550 | 217K| 152 (3)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("NAME1" LIKE '%JOHN%PUBLIC%' AND "NAME1" IS NOT NULL AND
"NAME1" IS NOT NULL OR "NAME1" LIKE '%PUBLIC%JOHN%' AND "NAME1" IS NOT NULL
AND "NAME1" IS NOT NULL OR "NAME2" LIKE '%JOHN%PUBLIC%' AND "NAME2" IS NOT
NULL AND "NAME2" IS NOT NULL OR "NAME2" LIKE '%PUBLIC%JOHN%' AND "NAME2" IS
NOT NULL AND "NAME2" IS NOT NULL)
This solution should definitely be faster than a full table scan. How much faster depends on the average name size and the name being searched. And depending on the query you may want to add additional columns to keep all the relevant data in the index.
Oracle Text is also a good option, but that feature feels a little "weird" in my opinion. If you're not already using text indexes you might want to stick with normal indexes to simplify administrative tasks.

Oracle partitioned table query cost vs non-partitioned table query cost

I have a table PO_HEADER with ~20 million records. Considering our future load on the table we have decided to partitioned the table to increase the performance of the sql queries. Below are the queries used to create the new partitioned tables.
CREATE TABLE PO_HEADER_LP
PARTITION BY LIST (BUYER_IDENTIFIER)
(PARTITION GC66287246AA VALUES ('GC66287246AA') TABLESPACE MITRIX_TABLES,
PARTITION GC43837235JK VALUES ('GC43837235JK') TABLESPACE MITRIX_TABLES,
PARTITION GC84338293AA VALUES ('GC84338293AA') TABLESPACE MITRIX_TABLES,
PARTITION DEFAULTBUID VALUES (DEFAULT) TABLESPACE MITRIX_TABLES)
AS SELECT *
FROM PO_HEADER;
create index PO_HEADER_LP_SI_IDX on PO_HEADER_LP("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES LOCAL;
Old Table PO_HEADER has two indexes on "BUYER_IDENTIFIER" and "SUPPLIER_IDENTIFIER" columns as follows:
create index PO_HEADER_BI_IDX on PO_HEADER("BUYER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
create index PO_HEADER_SI_IDX on PO_HEADER("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
To test the performance of the query, I executed below query on both the tables. But, to my wonder I saw the cost of the 2nd query is almost double than the 1st one. Can any body know, why is the query cost is high of the partitioned table compared to normal table. Thanks in Advance.
select * from po_header where buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 56,941
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 93,309
PO_HEADER with Global Index on buyer_identifier & supplier_identifier column
PO_HEADER_LP with Global Index on supplier_identifier column
PO_HEADER_LP with Local Index on supplier_identifier column
From your DDL I assume, you have three big buyers (say 5M records each) and a bunch of smaller ones. In other word this would be the correct setup for you list partitioning schema.
You may verify, whether it works testing access on buyer only:
EXPLAIN PLAN SET STATEMENT_ID = 'jara1' into plan_table FOR
select * from tab_lp where BUYER_ID = 1;
;
SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'jara1','ALL'));
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6662K| 82M| 4445 (2)| 00:00:01 | | |
| 1 | PARTITION LIST SINGLE| | 6662K| 82M| 4445 (2)| 00:00:01 | KEY | KEY |
| 2 | TABLE ACCESS FULL | TAB_LP | 6662K| 82M| 4445 (2)| 00:00:01 | 2 | 2 |
------------------------------------------------------------------------------------------------
The same query for the non-partitioned table should produce much higher cost. Why?
In the partitioned table the selected buyer (in your case GC84338293AA, I'm using surrogate keys) has it own partition.
So full scan of this partition is the best access.
select * from tab where BUYER_ID = 1;
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6596K| 81M| 14025 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB | 6596K| 81M| 14025 (1)| 00:00:01 |
--------------------------------------------------------------------------
1 - filter("BUYER_ID"=1)
For the non-partitioned table (to get approximately one fourth of the data) the FULL TABLE SCAN is OK as well,
but of course has higher cost as all data must be scanned.
Note - if you see here lower cost, unrealistically low Rows count and/or INDEX ACCESS,
than this is the cause of the problem of the underestimating of the cost. So don't worry the old cost are too low, not the new one too high!
The next step is the access on both buyer and supplier. To get the answer you must provide
additional information.
How selective is the supplier filter?
I.e. if the predicate buyer_identifier='GC84338293AA' returns say 5M records, how may records return the predicate with both columns?
buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'
Is it 4M or 100 records?
If the complete predicate returns only few records than the local index on supplier is OK.
If it returns large number of rows (say the quarter of the partition) - you should stay on FULL PARTITION SCAN and not use it.
This is similar to my comment on the non partitioned table.
Estimation of the supplier cardinality
In case that the column SUPPLIER contains a skewed data (which may fool the CBO to calulate improper cost) you may define explicitely histogram in this column.
I used this statement statement, that calculates the histogram on full data (100% is important for highly skewed data) and for the table and partition.
exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'TAB_LP',granularity=>'all',estimate_percent => 100,METHOD_OPT => 'for columns SUPPLIER_ID size 254');
This worked for my test data, i.e. for supplier with low cardinality an index access was opened (on local no-prefixed index) and for huge suppliers a full partition scan was used.
You can create a Local partitioned index using this script.
CREATE INDEX PO_HEADER_LOCAL_IDX ON PO_HEADER_LP
(BUYER_IDENTIFIER, SUPPLIER_IDENTIFIER)
LOCAL (
PARTITION GC66287246AA,
PARTITION GC43837235JK,
PARTITION GC84338293AA,
PARTITION DEFAULTBUID
);
Also it is recommended to gather statistics of the newly created partition table using this script:
EXEC DBMS_STATS.GATHER_TABLE_STATS('SCHEMA Name','PO_HEADER_LP');
Now you can generate the execution plan again of the following SQL:
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT';
Hope this will help you.

Oracle PLSQL: Performance issue when using TABLE functions

I am currently facing a performance problem when using the table functions. I will explain.
I am working with Oracle types and one of them is defined like below:
create or replace TYPE TYPESTRUCTURE AS OBJECT
(
ATTR1 VARCHAR2(30),
ATTR2 VARCHAR2(20),
ATTR3 VARCHAR2(20),
ATTR4 VARCHAR2(20),
ATTR5 VARCHAR2(20),
ATTR6 VARCHAR2(20),
ATTR7 VARCHAR2(20),
ATTR8 VARCHAR2(20),
ATTR9 VARCHAR2(20),
ATTR10 VARCHAR2(20),
ATTR11 VARCHAR2(20),
ATTR12 VARCHAR2(20),
ATTR13 VARCHAR2(10),
ATTR14 VARCHAR2(50),
ATTR15 VARCHAR2(13)
);
Then I have one table of this type like:
create or replace TYPE TYPESTRUCTURE_ARRAY AS TABLE OF TYPESTRUCTURE ;
I have one procedure with the following variables:
arr QCSTRUCTURE_ARRAY;
arr2 QCSTRUCTURE_ARRAY;
ARR is only containing one single instance of TYPESTRUCTURE with all its attributes set to NULL except ATTR4 which is set to 'ABC'
ARR2 is completelly empty.
Here comes the part which is giving me the performance issue.
The purpose is to take some values from a view (depending on the value on ATTR4) and fill those in same or similar structure. So I do the following:
SELECT TYPESTRUCTURE(MV.A,null,null,MV.B,MV.C,MV.D,null,null,MV.E,null,null,MV.F,MV.F,MV.G,MV.H)
BULK COLLECT INTO arr2
FROM TABLE(arr) PARS
JOIN MYVIEW MV
ON MV.B = PARS.ATTR4;
The code here works correctly except for the fact that is taking 15 seconds to execute the query...
This query is filling into ARR around 20 instances of TYPESTRUCTURE (or rows).
It could look like there may be lots of data on the view. But what gets me strange is that if I change the query and I set something hardcoded like the one below then is completelly fast (miliseconds)
SELECT TYPESTRUCTURE(MV.A,null,null,MV.B,MV.C,MV.D,null,null,MV.E,null,null,MV.F,MV.F,MV.G,MV.H)
BULK COLLECT INTO arr2
FROM (SELECT 'ABC' ATTR4 FROM DUAL) PARS
JOIN MYVIEW MV
ON MV.B = PARS.ATTR4;
In this new query I am directly hardcoding the value but keeping the join in order to try to test something as much similar as the one above but without the TABLE() function..
So here my question.... Is it possible that this TABLE() function is creating such a big delay with only having one single record inside? I would like to know whether someone can give me some advice on what is wrong in my approach and whether there may be some other way to achieve...
Thanks!!
This problem is likely caused by a poor optimizer estimate for the number of rows returned by the TABLE function. The CARDINALITY or DYNAMIC_SAMPLING hints may be the best way to solve the problem.
Cardinality estimate
Oracle gathers statistics on tables and indexes in order to estimate the cost of accessing those objects. The most important estimate is how many rows will be returned by an object. Procedural code does not have statistics, by default, and Oracle does not make any attempt to parse the code and estimate how many rows will be produced. Whenever Oracle sees a procedural row source it uses a static number. On my database, the number is 16360. On most databases the estimate is 8192, as beherenow pointed out.
explain plan for
select * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 16360 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 16360 |
--------------------------------------------------------------
Fix #1: CARDINALITY hint
As beherenow suggested, the CARDINALITY hint can solve this problem by statically telling Oracle how many rows to estimate.
explain plan for
select /*+ cardinality(1) */ * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 16360 |
--------------------------------------------------------------
Fix #2: DYNAMIC_SAMPLING hint
A more "official" solution is to use the DYNAMIC_SAMPLING hint. This hint tells Oracle to sample some data at run time before it builds the explain plan. This adds some cost to building the explain plan, but it will return the true number of rows. This may work much better if you don't know the number ahead of time.
explain plan for
select /*+ dynamic_sampling(2) */ * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 3 |
--------------------------------------------------------------
But what's really slow?
We don't know exactly was slow in your query. But whenever things are slow it's usually best to focus on the worst cardinality estimate. Row estimates are never perfect, but being off by several orders of magnitude can have a huge impact on an execution plan. In the simplest case it may change an index range scan to a full table scan.

Hive: SemanticException [Error 10002]: Line 3:21 Invalid column reference 'name'

I am using the following hive query script for the version 0.13.0
DROP TABLE IF EXISTS movies.movierating;
DROP TABLE IF EXISTS movies.list;
DROP TABLE IF EXISTS movies.rating;
DROP DATABASE IF EXISTS movies;
ADD JAR /usr/local/hadoop/hive/hive/lib/RegexLoader.jar;
CREATE DATABASE IF NOT EXISTS movies;
CREATE EXTERNAL TABLE IF NOT EXISTS movies.list (id STRING, name STRING, genre STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s");
CREATE EXTERNAL TABLE IF NOT EXISTS movies.rating (id STRING, userid STRING, rating STRING, timestamp STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'
with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s %4$s");
LOAD DATA LOCAL INPATH 'ml-10M100K/movies.dat' into TABLE movies.list;
LOAD DATA LOCAL INPATH 'ml-10M100K/ratings.dat' into TABLE movies.rating;
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating STRING);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, rating.rating from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id;
The issue is when I execute the script without the "GROUP BY" clause it works fine.
But when I execute it with the "GROUP BY" clause, I get the following error
FAILED: SemanticException [Error 10002]: Line 4:21 Invalid column reference 'name'
Any ideas what is happening here?
Appreciate your help
Thanks!
If you group by a column, your select statement can only select a) that column, b) columns derived only from that column, or c) a UDAF applied to other columns.
In this case, you're only grouping by list.id, so when you try to select list.name, that's invalid. Think about it this way: what if your list table contained the following two entries:
id|name |genre
--+-----+------
01|name1|comedy
01|name2|horror
What would you expect this query to return:
select list.id, list.name, list.genre from list group by list.id;
In this case it's nonsensical. I'm guessing that id in reality is a primary key, but note that hive does not know this, so the above data set is perfectly valid.
With all that in mind, it's not clear to me how to fix it because I don't know the desired output. For example, let's say without the group by (just the join), you have as output:
id|name |genre |rating
--+-----+------+-------
01|name1|comedy|'pretty good'
01|name1|comedy|'bad'
02|name2|horror|'9/10'
03|name3|action|NULL
What would you want the output to be with the group by? What are you trying to accomplish by doing the group by?
OK let me see if I can ask this in a better way.
Here are my two tables
Movies list table - Consists of movies information
ID | Movie Name | Genre
1 | Movie 1 | comedy
2 | movie 2 | action
3 | movie 3 | thriller
And I have ratings table
MOVIE_ID | USER ID | RATING on 5 | TIMESTAMP
1 | xyz | 5 | 12345612
1 | abc | 4 | 23232312
2 | zvc | 1 | 12321123
2 | zyx | 2 | 12312312
What I would like to do is get the output in the following way:
Movie ID | Movie Name | Genre | Rating Average
1 | Movie 1 | comedy | 4.5
2 | Movie 2 | action | 1.5
I am not a db expert but I understand this, when you group the data together you need to convert the multiple values to the scalar values or all the values, if string should be same right?
For example in my previous case, I was grouping them together as a string. So which is okay for list.id, list.name and list.genre, but the list.rating, well that is always going to give some problem here (I just learnt PIG along with hive, so grouping works differently there)
So to tackle the problem, I casted the rating and averaged it out and stored it in the float table. Have a look at my code below:
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating FLOAT);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, AVG(cast(rating.rating as FLOAT)) from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id, list.name,list.genre order by list.id DESC;
Thank you for your explanation. I might save the following question for the next thread but here is my observation:
The performance of the Overall job is reduced when performing Grouping and Joining together than to do it in two separate queries. For the same job, I had changed the code a bit to perform the grouping first and then joining the data and the over all time was reduced by 40 seconds. Earlier it was taking 140 seconds and now it is taking 100 seconds. Any reasons to that?
Once again thank you for your explanation.
I came across same issue:
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid column reference "charge_province"
After I put the "charge_province" in the group by, the issue is gone. I don't know why.

How should I range partition an index with a varchar2 column in Oracle? Is it a bad idea?

I am using Oracle 10g Enterprise edition.
A table in our Oracle database stores the soundex value representation of another text column. We are using a custom soundex implementation in which the soundex values are longer than are generated by traditional soundex algorithms (such as the one Oracle uses). That's really beside the point.
Basically I have a varchar2 column that has values containing a single character followed by a dynamic number of numeric values (e.g. 'A12345', 'S382771', etc). The table is partitioned by another column, but I'd like to add a partitioned index to the soundex column since it is often searched. When trying to add a range partitioned index using the first character of the soundex column it worked great:
create index IDX_NAMES_SOUNDEX on NAMES_SOUNDEX (soundex)
global partition by range (soundex) (
partition IDX_NAMES_SOUNDEX_PART_A values less than ('B'), -- 'A%'
partition IDX_NAMES_SOUNDEX_PART_B values less than ('C'), -- 'B%'
...
);
However, I in order to more evenly distribute the size of the partitions, I want to define some partitions by the first two chars, like so:
create index IDX_NAMES_SOUNDEX on NAMES_SOUNDEX (soundex)
global partition by range (soundex) (
partition IDX_NAMES_SOUNDEX_PART_A5 values less than ('A5'), -- 'A0% - A4%'
partition IDX_NAMES_SOUNDEX_PART_A values less than ('B'), -- 'A4% - A9%'
partition IDX_NAMES_SOUNDEX_PART_B values less than ('C'), -- 'B%'
...
);
I'm not sure how to properly range partition using varchar2 columns. I'm sure this is a less than ideal choice, so perhaps someone can recommend a better solution. Here's a distribution of the soundex data in my table:
-----------------------------------
| SUBSTR(SOUNDEX,1,1) | COUNT |
-----------------------------------
| A | 6476349 |
| B | 854880 |
| D | 520676 |
| F | 1200045 |
| G | 280647 |
| H | 3048637 |
| J | 711031 |
| K | 1336522 |
| L | 348743 |
| M | 3259464 |
| N | 1510070 |
| Q | 276769 |
| R | 1263008 |
| S | 3396223 |
| V | 533844 |
| W | 555007 |
| Y | 348504 |
| Z | 1079179 |
-----------------------------------
As you can see, the distribution is not evenly spread, which is why I want to define range partitions using the first two characters instead of just the first character.
Suggestions?
Thanks!
What exactly is your question?
Don't you know how you can split your table in n equal parts to avoid skew?
You can do that with analytic function percentile_disc().
Here an SQL PLUS example with n=100, I admit that it isn't very sophisticated but it will do the job.
set pages 0
set lines 200
drop table random_strings;
create table random_strings
as
select upper(dbms_random.string('A', 12)) rndmstr
from dual
connect by level < 1000;
spool parts
select 'select '||level||'/100,percentile_disc('||level||
'/100) within group (order by RNDMSTR) from random_strings;'
sql_statement
from dual
connect by level <= 100
/
spool off
This will output in file parts.lst:
select 1/100,percentile_disc(1/100) within group (order by RNDMSTR) from random_strings;
select 2/100,percentile_disc(2/100) within group (order by RNDMSTR) from random_strings;
select 3/100,percentile_disc(3/100) within group (order by RNDMSTR) from random_strings;
...
select 100/100,percentile_disc(100/100) within group (order by RNDMSTR) from random_strings;
Now you can run script parts.lst to get the partition values. Each partition will contain 1% of the data initially.
Script parts.lst will output:
,01 AJUDRRSPGMNP
,02 AOMJZQPZASQZ
,03 AWDQXVGLLUSJ
,04 BIEPUHAEMELR
....
,99 ZTMHDWTXUJAR
1 ZYVJLNATVLOY
Is the table is being searched by the partitioning key in addition to the SOUNDEX value? Or is it being searched just by the SOUNDEX column?
If you are just trying to achieve an even distribution of data among partitions, have you considered using hash partitions rather than range partitions? Assuming you choose a power of 2 for the number of partitions, that should give you a pretty even distribution of data between partitions.
Talk to me!
Can you tell me what your reason is for partitioning this table? It sounds like it is an OLTP table and may not need to be partition. We don’t want to partition just to say we are partitioned. Tell me what you are trying to accomplish by partitioning this table and I can help you pick a correct partitioning scheme. Partitioning does not equal faster queries. It actually can cause your queries to be slower in some cases.
I see some of your additional thoughts above and I don’t believe you need to partition your table. If your queries are going to be doing aggregates on entire partitions then you may want to partition. If you are going to have hundreds of millions of rows of data you may want to partition to help with DBA maintenance. If you just want you queries to run fast then the primary key index will suffice. Please let me know
Just create a global index on your desired columns.

Resources