Creating Views in Hive with parameter - hadoop

I have a table that contains rows belonging to various dates.
I want to CREATE A VIEW which should give me the data based on the date
CREATE VIEW newusers
AS
SELECT DISTINCT T1.uuid
FROM user_visit T1
WHERE T1.firstSeen="20140522";
I do not want to fix WHERE T1.firstSeen="20140522";
it can be any date like 20140525 etc.
Is there any way that I can create a view with date as parameter?

Not really sure if creating a view with such variable actually works. With Hive 1.2 an onwards, this is what happens when you create table.
hive> create view v_t1 as select * from t_t1 where d1="${hiveconf:v_val_dt}";
OK
Time taken: 6.222 seconds
hive> show create table v_t1;
OK
CREATE VIEW `v_t1` AS select `t_t1`.`f1`, `t_t1`.`d1` from `default`.`t_t1` where `t_t1`.`d1`="'2016-01-02'"
Time taken: 0.202 seconds, Fetched: 1 row(s)
When creating a view, it always takes the static constant value. The one thing that might work would be staying outside the prompt, something like this.
[hdfs#sandbox ~]$ hive -hiveconf v_val_dt=2016-01-01 -e 'select * from v_t1 where d1="${hiveconf:v_val_dt}";'
Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
OK
string_1 2016-01-01
Time taken: 7.967 seconds, Fetched: 1 row(s)
[hdfs#sandbox ~]$ hive -hiveconf v_val_dt=2016-01-06 -e 'select * from v_t1 where d1="${hiveconf:v_val_dt}";'
Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
OK
string_6 2016-01-06
Time taken: 10.967 seconds, Fetched: 1 row(s)

A different approach for this problem is creating a table in which you set key value pairs as parameters. In your view you can then reference this table.
create table load_params (key: string, value: string);
insert overwrite table load_params values ('firstSeen', '20140522');
Your view would then look like this:
create view newusers as
select distinct T1.uuid
from user_visit T1
where T1.firstSeen = (select cast(value as int) from load_params where key = 'firstSeen');
The load_params table can be edited before each run. As you would do when setting a different parameter with set.

In the hive script, just replace the date with a variable:
CREATE VIEW newusers
AS
SELECT DISTINCT T1.uuid
FROM user_visit T1
WHERE T1.firstSeen="${hiveconf:date}";
Then give that variable a value when invoking hive:
hive --hiveconf date=20140522 -f 'create_newusers_view.hql'
Or just set it from within hive:
set date=20140522;

Related

Hive select * shows 0 but count(1) show returns millions of rows

I create a hive external table and add a partition by
create external table tbA (
colA int,
colB string,
...)
PARTITIONED BY (
`day` string)
stored as parquet;
alter table tbA add partition(day= '2021-09-04');
Then I put a parquet file to the target HDFS directory by hdfs dfs -put ...
I can get expected results using select * from tbA in IMPALA.
In Hive, I can get correct result when using
select count(1) from tbA
However, when I use
select * from tbA limit 10
It returns no result at all.
If anything is wrong with the parquet file or the directory, IMPALA should not get the correct result and Hive can count out row numbers... Why select * from ... shows nothing? Any help is appreciated.
In addition,
running select distinct day from tbA, it returns 2021-09-04
running select * from tbA, it returns data with day = 2021-09-04
It seems this partition is not recognized correctly? I retried to drop the partition and use msck repair table but still not working...

updating a table using hive

Right now I run the following Hive query
CREATE TABLE dwo_analysis.exp_shown AS
SELECT
MIN(sc.date_time) as first_shown_time,
SUBSTR(sc.post_evar12,1,24) as guid,
sc.post_evar238 as experiment_name,
sc.post_evar239 as variant_name
FROM test
WHERE report_suite='adbemmarvelweb.prod'
AND date >= DATE_SUB(CURRENT_DATE,90) AND date < DATE_SUB(CURRENT_DATE, 2)
AND post_prop5 = 'experiment:standard:authenticated:shown'
AND post_evar238 NOT LIKE 'control%'
AND post_evar238 <> ''
AND post_evar239 <> ''
The table test is large. I would like to optimize this query by running it once, and every other time updating the table by getting the last 2 days of data and adding it to the table.
so basically run the above query once and every time run it again but with the condition
WHERE click_date >= DATE_SUB(CURRENT_DATE, 2) AND click_date < DATE_SUB(CURRENT_DATE)
How do I update the table using hive to populate the the rows as mentioned in the condition above?
First, your queries would be quicker if the Hive table were partitioned based on date. Your create table statement isn't inserting into any partitions, therefore I suspect your table is not partitioned. It would also be quicker if the source data were Parquet/ORC
In any case, you can overwrite the table for a date range like so
INSERT OVERWRITE TABLE dwo_analysis.exp_shown
SELECT * FROM test
WHERE click_date
BETWEEN DATE_SUB(CURRENT_DATE, 2) AND CURRENT_DATE;

Using spark/scala, I use saveAsTextFile() to HDFS, but hiveql("select count(*) from...) return 0

I created external table as follows...
hive -e "create external table temp_db.temp_table (a char(10), b int) PARTITIONED BY (PART_DATE VARCHAR(10)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/work/temp_db/temp_table'"
And I use saveAsTextFile() with scala in IntelliJ IDEA as follows...
itemsRdd.map(_.makeTsv).saveAsTextFile("hdfs://work/temp_db/temp_table/2016/07/19")
So the file(fields terminated by '\t') was in the /work/temp_db/temp_table/2016/07/19.
hadoop fs -ls /work/temp_db/temp_table/2016/07/19/part-00000 <- data file..
But, I checked with hiveql, there are no datas as follows.
hive -e "select count(*) from temp_db.temp_table" -> 0.
hive -e "select * from temp_db.temp_table limit 5" -> 0 rows fetched.
Help me what to do. Thanks.
you are saving at wrong location from spark. Partition dir name follows part_col_name=part_value.
In Spark: save file at directory part_date=2016%2F07%2F19 under temp_table dir
itemsRdd.map(_.makeTsv)
.saveAsTextFile("hdfs://work/temp_db/temp_table/part_date=2016%2F07%2F19")
add partitions: You will need to add partition that should update hive table's metadata (partition dir we have created from spark as hive expected key=value format)
alter table temp_table add partition (PART_DATE='2016/07/19');
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/temp_table/part*|awk '{print $NF}'
/user/hive/warehouse/temp_table/part_date=2016%2F07%2F19/part-00000
/user/hive/warehouse/temp_table/part_date=2016-07-19/part-00000
query partitioned data:
hive> alter table temp_table add partition (PART_DATE='2016/07/19');
OK
Time taken: 0.16 seconds
hive> select * from temp_table where PART_DATE='2016/07/19';
OK
test1 123 2016/07/19
Time taken: 0.219 seconds, Fetched: 1 row(s)
hive> select * from temp_table;
OK
test1 123 2016/07/19
test1 123 2016-07-19
Time taken: 0.199 seconds, Fetched: 2 row(s)
For Everyday process: you can run saprk job like this - just add partitions right after saveAsTextFile(), aslo note the s in alter statement. it is need to pass variable in hive sql from spark:
val format = new java.text.SimpleDateFormat("yyyy/MM/dd")
vat date = format.format(new java.util.Date())
itemsRDD.saveAsTextFile("/user/hive/warehouse/temp_table/part=$date")
val hive = new HiveContext(sc)
hive.sql(s"alter table temp_table add partition (PART_DATE='$date')")
NOTE: Add partition after saving the file or else spark will throw directory already exist exception as hive creates dir (if not exist) when adding partition.

Hive: Insert into hive table with column using select 1

Let's say I have a hive table test_entry with column called entry_id.
hive> desc test_entry;
OK
entry_id int
Time taken: 0.4 seconds, Fetched: 1 row(s)
Suppose I need to insert one row into this above table using select 1 (which returns 1). For example: A syntax which looks like the below:
hive> insert into table test_entry select 1;
But I get the below error:
FAILED: NullPointerException null
So effectively, I would like to insert one row for entry)id whose value will be 1 with such a select statement(without referring another table).
How can this be done?
Hive does not support what you're trying to do. Inserts to ORC based tables was introduced in Hive 0.13.
Prior to that, you have to specify a FROM clause if you're doing INSERT .. SELECT
A workaround might be to create an external table with one row and do the following
INSERT .. SELECT 1 FROM table_with_one_row

how to select data from hive with specific partition?

everyone. here are the interactions with the hive:
hive> show partitions TABLENAME
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
pt=2012.07.28.10/is_complete=1
pt=2012.07.28.11/is_complete=1
hive> select * from TABLENAME where pt='2012.07.28.10/is_complete=1' limit 1;
OK
Time taken: 2.807 seconds
hive> select * from TABLENAME where pt='2012.07.28.10' limit 1;
OK
61806fd3-5535-42a1-9ca5-91676d0e783f 1.160.243.215.1343401203879.1 2012-07-28 23:36:37
Time taken: 3.8 seconds
hive>
My question is that why the first select can't get the data?
"is_complete" is a column just like "pt" so the correct query is:
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
If you are using Ambari, you can query as below
select * from TABLE NAME WHERE PARTITION NAME and AND ANOTHER PARTITION NAME LIMIT 10
Here partitions are associated with table so we query directly considering them as table(Simple analogy). Here "/" symbol tells its another folder directory. Each partitioned table data will store in associated directory. For ex if we have partitions like below
year=2017/month=11/day=1/part=1
then we can use
select * from TABLE NAME where year=2017 AND month=11 AND day=1 AND part=1 LIMIT 10;

Resources