Is expression based partitioning supported in hive? - hadoop

I have a table with a column, can i create a partition based on an expression using that column
I read that IBM's Big SQL technology has this feature.
I also know we can partition in hive by a column but what about an expression?
In this case i am doing a cast..it could be any expression
CREATE TABLE INVENTORY_A (
trans_id int,
product varchar(50),
trans_ts timestamp
)
PARTITIONED BY (
cast(trans_ts as date) AS date_part
)
I expect the records to be partitioned by the date value. So I expect that when a user writes a query like
select * from INVENTORY_A where trans_ts BETWEEN timestamp '2016-06-23 14:00:00.000' AND timestamp '2016-06-23 14:59:59.000'
the query will be smart enough to break the timestamp down by the date and do a filter only on the date

You can use Dynamic partitioning and cast your variables in select query.

Related

Count date strings between a range of dates

I have a hive table (table_1). In that table, one of the columns is called 'date'. Values in that column are 'string' type and in the format 'yyyyMMdd', (ex: 20210102). I am trying to get the count(*) of records of a range of dates in that column.
Ex: select count(*) from table_1 where date BETWEEN 20210101 AND 20210301. This will not work now since that column is 'string' type. Need some help querying the DATE version of that column.

Is there any order of columns while creating Hive table that needs to be pairtitioned dynamically?

I am trying to load an RDBMS table into Hive. I need to partition the table dynamically based on a column data. I have the schema of the Greenplum table as below:
forecast_id:bigint
period_year:numeric(15,0)
period_num:numeric(15,0)
period_name:character varying(15)
drm_org:character varying(10)
ledger_id:bigint
currency_code:character varying(15)
source_system_name:character varying(30)
source_record_type:character varying(30)
xx_last_update_log_id:integer
xx_data_hash_code:character varying(32)
xx_data_hash_id:bigint
xx_pk_id:bigint
When I checked for the schema of the same table on Hive (which is usually replicated on Hive), I did describe extended tablename and got the below schema:
forecast_id bigint
period_year bigint
period_num bigint
period_name string
drm_org string
ledger_id bigint
currency_code string
source_record_type string
xx_last_update_log_id int
xx_data_hash_code string
xx_data_hash_id bigint
xx_pk_id bigint
source_system_name String
so I asked my lead why is the column: source_system_name given at the end in Hive table and I got an answer: "The columns that are used to partition the hive table dynamically, comes at the end of the table"
Is it true that the columns on which the hive table is dynamically partitioned should come at the end of the schema ?
The order of the columns matter when you are dynamic partition in Hive. You can find more details here. From the documentation
In INSERT ... SELECT ... queries, the dynamic partition columns must
be specified last among the columns in the SELECT statement and in the
same order in which they appear in the PARTITION() clause.

Hive partition table by month from daily timestamp?

Is it possible to create partition like 01 from date like 2017-01-02' where 01 is month ?
I have daily sales record and I need to do query like select * from sales where month = '01'. So it will be better if I could partition my daily sales by month.but my data has date of format 2017-01-01 and doing
create table tl (columns ......) partitioned by (date <datatype> ) will create partition on daily basis which is the last thing I want .
I need to create partition dynamically.
CAUTION:- You need to escape date column(by using ` i.e. backtick around column name) in create statement. Because date is a datatype in hive.
You can create partitions dynamically:-
by setting below parameter in query.
set hive.exec.dynamic.partition.mode=nonstrict;
Along with that you need to select only month part from source table:-
insert into table sales partition(date) select columns...,SUBSTR(date,5,2) from source_table
This insert statement will create partitions like.
show partitions sales
date=01
date=02
date=03
date=04

How to use insert statement for a Hive partitioned table?

I have a hive table dynpart.
id int
name char(30)
city char(30)
thisday string
# Partition Information
# col_name data_type comment
thisday string
It is partitioned by 'thisday' whose datatype is STRING.
How can I insert a single record into the table in a particular partition. I know there is load command to load an entire file data into hive table. I just want to know how an Insert statement can be written for a partitioned table. I tried to write command like below but this is taking data from another table.
insert into droplater partition(thisday='30/03/2017') select * from dynpart;
The table: Droplater has the same structure as dynpart. But the above command is to insert the data from another table. What I'd like to learn is to write a simple insert command into a partition, like: insert into tabname values(1,"abcd","efgh");into the table.
This will work for primitive types only (no arrays, structs etc.)
insert into tabname partition (thisday='30/03/2017') values (1,"abcd","efgh");
This will work for all types
insert into tabname partition (thisday='30/03/2017') select 1,"abcd","efgh";
P.s.
By all means, partition your table by date ((thisday date) )
insert into tabname partition (thisday=date '2017-03-30') ...
or at least use the ISO date format
insert into tabname partition (thisday='2017-03-30') ...

Map a hive partition to a location

I have a hive external table with partition by year, month day and hour.
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
'hdfs://path/to/data'
The data exists in directories such as
2014/05/10/07/00
2014/05/10/07/01
...
2014/05/10/07/22
2014/05/10/07/23
I get results When I select data using the following:
Select * from my_table where year=2014 and month="05" and day="07" and hour="03"
but I want to be able to query with out the quotes for values starting with a zero. Currently the following two examples don't work:
Select * from my_table where year=2014 and month=05 and day=07 and hour=03
Select * from my_table where year=2014 and month=5 and day=7 and hour=3
How can I support this? (instead of changing the directories not to have zero prefix on single digit values).
Thanks,
Guy
Before I go into the answer, this does involve changing the directory names but it will really make querying simple for you.
We have a similar kind of structure for our partitions but instead of using the names is this format 2014/05/10/07/22, we use it like 2014/201405/20140510/07/20140510.22. Basically the partitions are:
PARTITIONED BY
(
years bigint,
months bigint,
days bigint,
hours float
)
Now coming to the advantages of using this:
Query mentioned in the question:
Select * from my_table where year=2014 and month=05 and day=07 and hour=03
After new partitions
Select * from my_table where hour = 20140507.03
Also other queries on days and months can be run directly without explicitly specifying months and years.

Resources