Hive: Insert into hive table with column using select 1 - hadoop

Let's say I have a hive table test_entry with column called entry_id.
hive> desc test_entry;
OK
entry_id int
Time taken: 0.4 seconds, Fetched: 1 row(s)
Suppose I need to insert one row into this above table using select 1 (which returns 1). For example: A syntax which looks like the below:
hive> insert into table test_entry select 1;
But I get the below error:
FAILED: NullPointerException null
So effectively, I would like to insert one row for entry)id whose value will be 1 with such a select statement(without referring another table).
How can this be done?

Hive does not support what you're trying to do. Inserts to ORC based tables was introduced in Hive 0.13.
Prior to that, you have to specify a FROM clause if you're doing INSERT .. SELECT
A workaround might be to create an external table with one row and do the following
INSERT .. SELECT 1 FROM table_with_one_row

Related

Hive:- Select on partition column gives result even when table is truncated

I have a table in hive which is partitioned on a column.
I truncate the table and I do 2 selects on them.
Select count(*) from table, I get the result as 0. Which is expected.
But If I do a select on the partition column, I get results instead of null rows which I expected.
Select distinct <paritition column>
from table
Result:
partition value 1
partition value 2
......
I can see in hdfs that the partition folders still exists , though they are empty. I expected the metadata to get updated after I did the truncate.
I am not sure why I am getting the above result
Any help is appreciated
Thanks!

Hive select * shows 0 but count(1) show returns millions of rows

I create a hive external table and add a partition by
create external table tbA (
colA int,
colB string,
...)
PARTITIONED BY (
`day` string)
stored as parquet;
alter table tbA add partition(day= '2021-09-04');
Then I put a parquet file to the target HDFS directory by hdfs dfs -put ...
I can get expected results using select * from tbA in IMPALA.
In Hive, I can get correct result when using
select count(1) from tbA
However, when I use
select * from tbA limit 10
It returns no result at all.
If anything is wrong with the parquet file or the directory, IMPALA should not get the correct result and Hive can count out row numbers... Why select * from ... shows nothing? Any help is appreciated.
In addition,
running select distinct day from tbA, it returns 2021-09-04
running select * from tbA, it returns data with day = 2021-09-04
It seems this partition is not recognized correctly? I retried to drop the partition and use msck repair table but still not working...

updating a table using hive

Right now I run the following Hive query
CREATE TABLE dwo_analysis.exp_shown AS
SELECT
MIN(sc.date_time) as first_shown_time,
SUBSTR(sc.post_evar12,1,24) as guid,
sc.post_evar238 as experiment_name,
sc.post_evar239 as variant_name
FROM test
WHERE report_suite='adbemmarvelweb.prod'
AND date >= DATE_SUB(CURRENT_DATE,90) AND date < DATE_SUB(CURRENT_DATE, 2)
AND post_prop5 = 'experiment:standard:authenticated:shown'
AND post_evar238 NOT LIKE 'control%'
AND post_evar238 <> ''
AND post_evar239 <> ''
The table test is large. I would like to optimize this query by running it once, and every other time updating the table by getting the last 2 days of data and adding it to the table.
so basically run the above query once and every time run it again but with the condition
WHERE click_date >= DATE_SUB(CURRENT_DATE, 2) AND click_date < DATE_SUB(CURRENT_DATE)
How do I update the table using hive to populate the the rows as mentioned in the condition above?
First, your queries would be quicker if the Hive table were partitioned based on date. Your create table statement isn't inserting into any partitions, therefore I suspect your table is not partitioned. It would also be quicker if the source data were Parquet/ORC
In any case, you can overwrite the table for a date range like so
INSERT OVERWRITE TABLE dwo_analysis.exp_shown
SELECT * FROM test
WHERE click_date
BETWEEN DATE_SUB(CURRENT_DATE, 2) AND CURRENT_DATE;

Updating unique id column for newly added records in table in hive

I have a table in which I want unique identifier to be added automatically as a new record is inserted into it. Considering I have column for unique identifier already created.
hive can't update the table but you can create a temporary table or overwrite your first table.
you can also use concat function to join the two diferent column or string.
here is the examples
function :concat(string A, string B…)
return: string
hive> select concat(‘abc’,'def’,'gh’) from dual;
abcdefgh
HQL &result
insert overwrite table stock select tradedate,concat('aa',tradetime),stockid ,buyprice,buysize ,sellprice,sellsize from stock;
20130726 aa094251 204001 6.6 152000 6.605 100
20130726 aa094106 204001 6.45 13400 6.46 100

Creating Views in Hive with parameter

I have a table that contains rows belonging to various dates.
I want to CREATE A VIEW which should give me the data based on the date
CREATE VIEW newusers
AS
SELECT DISTINCT T1.uuid
FROM user_visit T1
WHERE T1.firstSeen="20140522";
I do not want to fix WHERE T1.firstSeen="20140522";
it can be any date like 20140525 etc.
Is there any way that I can create a view with date as parameter?
Not really sure if creating a view with such variable actually works. With Hive 1.2 an onwards, this is what happens when you create table.
hive> create view v_t1 as select * from t_t1 where d1="${hiveconf:v_val_dt}";
OK
Time taken: 6.222 seconds
hive> show create table v_t1;
OK
CREATE VIEW `v_t1` AS select `t_t1`.`f1`, `t_t1`.`d1` from `default`.`t_t1` where `t_t1`.`d1`="'2016-01-02'"
Time taken: 0.202 seconds, Fetched: 1 row(s)
When creating a view, it always takes the static constant value. The one thing that might work would be staying outside the prompt, something like this.
[hdfs#sandbox ~]$ hive -hiveconf v_val_dt=2016-01-01 -e 'select * from v_t1 where d1="${hiveconf:v_val_dt}";'
Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
OK
string_1 2016-01-01
Time taken: 7.967 seconds, Fetched: 1 row(s)
[hdfs#sandbox ~]$ hive -hiveconf v_val_dt=2016-01-06 -e 'select * from v_t1 where d1="${hiveconf:v_val_dt}";'
Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
OK
string_6 2016-01-06
Time taken: 10.967 seconds, Fetched: 1 row(s)
A different approach for this problem is creating a table in which you set key value pairs as parameters. In your view you can then reference this table.
create table load_params (key: string, value: string);
insert overwrite table load_params values ('firstSeen', '20140522');
Your view would then look like this:
create view newusers as
select distinct T1.uuid
from user_visit T1
where T1.firstSeen = (select cast(value as int) from load_params where key = 'firstSeen');
The load_params table can be edited before each run. As you would do when setting a different parameter with set.
In the hive script, just replace the date with a variable:
CREATE VIEW newusers
AS
SELECT DISTINCT T1.uuid
FROM user_visit T1
WHERE T1.firstSeen="${hiveconf:date}";
Then give that variable a value when invoking hive:
hive --hiveconf date=20140522 -f 'create_newusers_view.hql'
Or just set it from within hive:
set date=20140522;

Resources