updating a table using hive - hadoop

Right now I run the following Hive query
CREATE TABLE dwo_analysis.exp_shown AS
SELECT
MIN(sc.date_time) as first_shown_time,
SUBSTR(sc.post_evar12,1,24) as guid,
sc.post_evar238 as experiment_name,
sc.post_evar239 as variant_name
FROM test
WHERE report_suite='adbemmarvelweb.prod'
AND date >= DATE_SUB(CURRENT_DATE,90) AND date < DATE_SUB(CURRENT_DATE, 2)
AND post_prop5 = 'experiment:standard:authenticated:shown'
AND post_evar238 NOT LIKE 'control%'
AND post_evar238 <> ''
AND post_evar239 <> ''
The table test is large. I would like to optimize this query by running it once, and every other time updating the table by getting the last 2 days of data and adding it to the table.
so basically run the above query once and every time run it again but with the condition
WHERE click_date >= DATE_SUB(CURRENT_DATE, 2) AND click_date < DATE_SUB(CURRENT_DATE)
How do I update the table using hive to populate the the rows as mentioned in the condition above?

First, your queries would be quicker if the Hive table were partitioned based on date. Your create table statement isn't inserting into any partitions, therefore I suspect your table is not partitioned. It would also be quicker if the source data were Parquet/ORC
In any case, you can overwrite the table for a date range like so
INSERT OVERWRITE TABLE dwo_analysis.exp_shown
SELECT * FROM test
WHERE click_date
BETWEEN DATE_SUB(CURRENT_DATE, 2) AND CURRENT_DATE;

Related

how to make max function in hive query to ignore _HIVE_DEFAULT_PARTITION__

I have a view which uses max to show the latest partition (which is of format 2021-01, 2021-02, 2021-03, 2021-04). The hive table has _HIVE_DEFAULT_PARTITION__ too.
When we run the query in Impala, max on partitions gives the correct value of 2021-04 ignoring _HIVE_DEFAULT_PARTITION__ but the same do not work when we run the query in Hive as it returns _HIVE_DEFAULT_PARTITION__
Is there any way to make Hive query ignore the default partition if exists while returning max on that column?
You can filter it:
select max(partition_col) from your_table where partition_col != "__HIVE_DEFAULT_PARTITION__"
If you do not need data in __HIVE_DEFAULT_PARTITION__, you can drop it:
ALTER TABLE your_table DROP PARTITION (partition_col='__HIVE_DEFAULT_PARTITION__');
Transforming __HIVE_DEFAULT_PARTITION__ to NULL can be a solution if with max(partition_col) you want to aggregate something else and do not want to excluse __HIVE_DEFAULT_PARTITION__ partition:
select max(case when partition_col = "__HIVE_DEFAULT_PARTITION__" then NULL else partition_col end) as max_partition_col,
--aggregate something else including HIVE_DEFAULT_PARTITION
from your_table

How to replace NULL values in one column to 0 (of a very large table) without creating a new column of the desired results added to the table in HIVE?

I am trying to replace all of the NULL values to 0 in a column of a big table in HIVE.
However, every time I try to implement some code I end up generating a new column to the table. The column I am trying to change/modify still exists and still has the NULL values but the new column that is automatically generated (i.e. _c1) is what I want the column I am trying to modify, to look like.
I tried to run a COALESCE but that also ended up generating a new column. I also tried to implement a CASE WHEN, but the same results ensued.
Select *,
CASE WHEN columnname IS NULL THEN 0
ELSE columnname
END
from tablename;
Also tried
SELECT coalesce(columnname, CAST(0 AS BIGINT)) FROM tablename
I would just like to update the table with the other columns being as is but the column I want to modify still has its original name but instead of NULL values it has 0's that replaced them.
I don't want to generate a new column but modify an existing one.
How should I do that?
Use insert overwrite .. option.
insert overwrite table tablename
select c1,c2,...,coalesce(columnname,0) as columnname
from tablename
Note that you have to specify all the other column names required in select.

Performance of date time concatenation into timestamp

Oracle 12C, non partitioned, no ASM.
This is the background. I have a table with multiple columns, 3 of them being -
TRAN_DATE DATE
TRAN_TIME TIMESTAMP(6)
FINAL_DATETIME NOT NULL TIMESTAMP(6)
The table has around 70 million records. What I want to do is concatenate the tran_date and the tran_time field and update the final_datetime field with that output, for all 70 million records.
This is the query I have -
update MYSCHEMA.MYTAB set FINAL_DATETIME = (to_timestamp( (to_char(tran_date, 'YYYY/MM/DD') || ' ' ||to_char(TRAN_TIME,'HH24MISS') ), 'YYYY-MM-DD HH24:MI:SS.FF'))
Eg:
At present (for one record)
TRAN_DATE=01-DEC-16
TRAN_TIME=01-JAN-70 12.35.58.000000 AM /*I need only the time part from this*/
FINAL_DATETIME=22-OCT-18 04.37.18.870000 PM
Post the query - the FINAL_DATETIME needs to be
01-DEC-16 12.35.58.000000 AM
The to_timestamp does require 2 character strings and I fear this will slow down the update a lot. Any suggestions?
What more can I do to increase performance? No one else will be using the table at this point, so, I do have the option to
Drop indices
Turn off logging
and anything more anyone can suggest.
Any help is appreciated.
I would prefer CTAS method and your job would be simpler if you didn't have indexes, triggers and constraints on your table.
Create a new table for the column to be modified.
CREATE TABLE mytab_new
NOLOGGING
AS
SELECT /*+ FULL(mytab) PARALLEL(mytab, 10) */ --hint to speed up the select.
CAST(tran_date AS TIMESTAMP) + ( tran_time - trunc(tran_time) ) AS final_datetime
FROM mytab;
I have included only one(the final) column in your new table because storing the other two in the new table is waste of resources. You may include other columns in select apart from the two now redundant ones.
Read logging/nologging to know about NOLOGGING option in the select.
Next step is to rebuild indexes, triggers and constraints for the new table new_mytab using the definition from mytab for other columns if they exist.
Then rename the tables
rename mytab to mytab_bkp;
rename mytab_new to mytab;
You may drop the table mytab_bkp after validating the new table or later when you feel you no longer need it.
Demo

Hive cannot read ORC if set "orc.create.index"="false" when loading table

Hive version: 1.2.1, create a table by the below:
CREATE TABLE ORC_NONE(
millisec bigint,
...
)
stored as orc tblproperties ("orc.create.index"="false");
insert into table ORC_NONE select * from ex_test_convert;
But when giving query, it always return NULL. For example:
Select * from ORC_NONE limit 10; // return blank
Select min(millisec), max(millisec) from ORC_NONE; // return NULL, NULL
I check the size of ORC_NONE, 2G, so it is not empty table, and if creating table by setting "orc.create.index"="true", queries work.
I was meant to test Hive performance on ORC with/without row indexes, more exactly, to test the skipping power of row indexes. However, it seemed that Hive can not read data when row index unavailable.
Is this a bug? Or something wrong with my loading?

Hive: Insert into hive table with column using select 1

Let's say I have a hive table test_entry with column called entry_id.
hive> desc test_entry;
OK
entry_id int
Time taken: 0.4 seconds, Fetched: 1 row(s)
Suppose I need to insert one row into this above table using select 1 (which returns 1). For example: A syntax which looks like the below:
hive> insert into table test_entry select 1;
But I get the below error:
FAILED: NullPointerException null
So effectively, I would like to insert one row for entry)id whose value will be 1 with such a select statement(without referring another table).
How can this be done?
Hive does not support what you're trying to do. Inserts to ORC based tables was introduced in Hive 0.13.
Prior to that, you have to specify a FROM clause if you're doing INSERT .. SELECT
A workaround might be to create an external table with one row and do the following
INSERT .. SELECT 1 FROM table_with_one_row

Resources