Hive select from table as complex type - hadoop

Considering a base table employee and a table derived from employee called employee_salary_period which contains a complex datatype map. How to select and insert data from employee into employee_salary_period where salary_period_map is a key value pair i.e. salary: period
CREATE TABLE employee(
emp_id bigint,
name string,
address string,
salary double,
period string,
position string
)
PARTITIONED BY (
dept_id bigint)
STORED AS PARQUET
CREATE TABLE employee_salary_period(
emp_id
name string,
salary string,
period string,
salary_period_map Map<String,String>,
)
PARTITIONED BY (
dept_id bigint)
STORED AS PARQUET
I'm stuck trying to figure out how to select data as salary_period_map

Consider using str_to_map function provided by hive. I hope you have only one key (salary) in you map
select
emp_id
name,
salary,
period,
str_to_map(concat(salary,":",period),'&',':') as salary_period_map
from employee_salary_period

Related

clickhouse MATERIALIZED VIEW issues

I created MATERIALIZED VIEW like this :
create target table:
CREATE TABLE user_deatils_daily (
day date,
hour UInt8 ,
appid UInt32,
isp String,
city String,
country String,
session_count UInt64,
avg_score AggregateFunction(avg, Float32),
min_revenue AggregateFunction(min, Float32),
max_load_time AggregateFunction(max, Int32)
)
ENGINE = SummingMergeTree()
PARTITION BY toRelativeWeekNum(day)
ORDER BY (day,hour)
create mv:
CREATE MATERIALIZED VIEW user_deatils_daily_mv
TO user_deatils_daily as
select toDate(session_ts) as day, toHour(toDateTime(session_ts)) as hour,appid,isp,city,country,
count(session_uuid) as session_count,avgState() as avg_score,
minState(revenue) as min_revenue,
maxState(perf_page_load_time) as max_load_time
from user_deatils where toDate(session_ts)>='2020-08-26' group by session_ts,appid,isp,city,country
the data in the target table starting to fill with data.
after some times the target table is getting fill with new data and doesn't' save the old one.
why is that?
SummingMergeTree() PARTITION BY toRelativeWeekNum(day) ORDER BY (day,hour)
means calculate sums groupby toRelativeWeekNum(day), day,hour)
user_deatils_daily knows nothing about user_deatils_daily_mv. They are not related.
user_deatils_daily_mv just does inserts into user_deatils_daily
SummingMergeTree knows nothing about group by session_ts,appid,isp,city,country
I would expect to see ORDER BY (ts,appid,isp,city,country);
I would do:
CREATE TABLE user_details_daily
( ts DateTime,
appid UInt32,
isp String,
city String,
country String,
session_count SimpleAggregateFunction(sum,UInt64),
avg_score AggregateFunction(avg, Float32),
min_revenue SimpleAggregateFunction(min, Float32),
max_load_time SimpleAggregateFunction(max, Int32) )
ENGINE = AggregatingMergeTree()
PARTITION BY toStartOfWeek(ts)
ORDER BY (ts,appid,isp,city,country);
CREATE MATERIALIZED VIEW user_deatils_daily_mv TO user_details_daily
as select
toStartOfHour(toDateTime(session_ts)) ts,
appid,
isp,
city,
country,
count(session_uuid) as session_count ,
avgState() as avg_score,
min(revenue) as min_revenue,
max(perf_page_load_time) as max_load_time
from user_details
where toDate(session_ts)>='2020-08-26' group by ts,appid,isp,city,country;

Hive Partition By dynamic value in s3 file name

Assuming an S3 location with required data is of the form:
s3://stack-overflow-example/v1/
where each file title in v1/ is of the form
francesco_{YYY_DD_MM_HH}_totti.csv
and each csv file contains a unix timestamp as a column in each row.
Is it possible to create an external hive table partitioned by the {YYY_DD_MM_HH} in each file name without first creating an unpartitioned table?
I have tried the below:
create external table so_test
(
a int,
b int,
unixtimestamp string
)
PARTITIONED BY (
from_unixtime(CAST(ord/1000 as BIGINT), 'yyyy-MM-dd') string
)
LOCATION 's3://stack-overflow-example/v1'
but this fails.
An option that should work is creating an unpartitioned table like the below:
create external table so_test
(
a int,
b int,
unixtimestamp string
);
LOCATION 's3://stack-overflow-example/v1'
and then dynamically inserting into a partitioned table:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
create external table so_test_partitioned
(
a int,
b int,
unixtimestamp string
)
PARTITIONED BY (
datep string
)
LOCATION 's3://stack-overflow-example/v1';
INSERT OVERWRITE TABLE so_test_partitioned PARTITION (date)
select
a,
b,
unixtimestamp,
from_unixtime(CAST(ord/1000 as BIGINT), 'yyyy-MM-dd') as datep,
from so_test;
Is creating an unpartitioned table first the only way?

How to get the value of clob data which has date field and compare with timestamp in oracle

I have a table named masterregistry and it contains all the info and business logic in it and the data type of the colum is clob
desc master_registy:
id number not null,
name varchar2(100),
value clob
select value from master_registry where name='REG_DATE';
o/p
11-10-17
This date is common across all the business logic, I need to query my tables which has ,
desc get_employee
====================
id number not null,
first_name varchar2(100),
last_name varchar2(100),
last_mod_dt timestamp
Now I need to get all the values from the get_employee whose last_mod_dt should be greater than the value of master_registry where name='REG_DATE'.The value in the latter table is clob data, how to fetch and compare the date of a clob data against the timestamp from another table. Please help.
Maybe you need something like this.
SELECT *
FROM get_employee e
WHERE last_mod_dt > (SELECT TO_TIMESTAMP (TO_CHAR (VALUE), 'DD-MM-YY')
FROM master_registy m
WHERE m.id = e.id);
DEMO
Note that i have used the column value directly in TO_CHAR. You may have to use TRIM,SUBSTR or whatever required to get ONLY the date component.

Insert data of 2 Hive external tables in new External table with additional column

I have 2 external hive tables as follows. I have populated data in them from oracle using sqoop.
create external table transaction_usa
(
tran_id int,
acct_id int,
tran_date string,
amount double,
description string,
branch_code string,
tran_state string,
tran_city string,
speendby string,
tran_zip int
)
row format delimited
stored as textfile
location '/user/stg/bank_stg/tran_usa';
create external table transaction_canada
(
tran_id int,
acct_id int,
tran_date string,
amount double,
description string,
branch_code string,
tran_state string,
tran_city string,
speendby string,
tran_zip int
)
row format delimited
stored as textfile
location '/user/stg/bank_stg/tran_canada';
Now i want to merge above 2 tables data as it is in 1 external hive table with all same fields as in the above 2 tables but with 1 extra column to identify that which data is from which table. The new external table with additional column as source_table. The new external table is as follows.
create external table transaction_usa_canada
(
tran_id int,
acct_id int,
tran_date string,
amount double,
description string,
branch_code string,
tran_state string,
tran_city string,
speendby string,
tran_zip int,
source_table string
)
row format delimited
stored as textfile
location '/user/gds/bank_ds/tran_usa_canada';
how can I do it.?
You do SELECT from each table and perform UNION ALL operation on these results and finally insert the result into your third table.
Below is the final hive query:
INSERT INTO TABLE transaction_usa_canada
SELECT tran_id, acct_id, tran_date, amount, description, branch_code, tran_state, tran_city, speendby, tran_zip, 'transaction_usa' AS source_table FROM transaction_usa
UNION ALL
SELECT tran_id, acct_id, tran_date, amount, description, branch_code, tran_state, tran_city, speendby, tran_zip, 'transaction_canada' AS source_table FROM transaction_canada;
Hope this help you!!!
You can very well do it by manual partitioning as well.
CREATE TABLE transaction_new_table (
tran_id int,
acct_id int,
tran_date string,
amount double,
description string,
branch_code string,
tran_state string,
tran_city string,
speendby string,
tran_zip int
)
PARTITIONED BY (sourcetablename String)
Then run below command,
load data inpath 'hdfspath' into table transaction_new_table partition(sourcetablename='1')
You could use the INSERT INTO Clause of Hive
INSERT INTO TABLE table transaction_usa_canada
SELECT tran_id, acct_id, tran_date, ...'transaction_usa' FROM transaction_usa;
INSERT INTO TABLE table transaction_usa_canada
SELECT tran_id, acct_id, tran_date, ...'transaction_canada' FROM transaction_canada;

Hive partitions on tables

when we partition a table the columns on which the table is being partitioned are not mentioned in the create statement and separately used in the partitioned by.What is the reason behind this.
CREATE TABLE REGISTRATION DATA (
userid BIGINT,
First_Name STRING,
Last_Name STRING,
address1 STRING,
address2 STRING,
city STRING,
zip_code STRING,
state STRING
)
PARTITION BY (
REGION STRING,
COUNTRY STRING
)
The partition that we create in hive makes a pseudocolumn on which we can query directly without having them in create statement.
So when we include partition column on the data of the table itself(create query) we will be getting error like 'Error in semantic analysis. Columns repeated in partitioning columns'

Resources