how to select data from hive with specific partition? - hadoop

everyone. here are the interactions with the hive:
hive> show partitions TABLENAME
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
pt=2012.07.28.10/is_complete=1
pt=2012.07.28.11/is_complete=1
hive> select * from TABLENAME where pt='2012.07.28.10/is_complete=1' limit 1;
OK
Time taken: 2.807 seconds
hive> select * from TABLENAME where pt='2012.07.28.10' limit 1;
OK
61806fd3-5535-42a1-9ca5-91676d0e783f 1.160.243.215.1343401203879.1 2012-07-28 23:36:37
Time taken: 3.8 seconds
hive>
My question is that why the first select can't get the data?

"is_complete" is a column just like "pt" so the correct query is:
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;

If you are using Ambari, you can query as below
select * from TABLE NAME WHERE PARTITION NAME and AND ANOTHER PARTITION NAME LIMIT 10
Here partitions are associated with table so we query directly considering them as table(Simple analogy). Here "/" symbol tells its another folder directory. Each partitioned table data will store in associated directory. For ex if we have partitions like below
year=2017/month=11/day=1/part=1
then we can use
select * from TABLE NAME where year=2017 AND month=11 AND day=1 AND part=1 LIMIT 10;

Related

Hive select * shows 0 but count(1) show returns millions of rows

I create a hive external table and add a partition by
create external table tbA (
colA int,
colB string,
...)
PARTITIONED BY (
`day` string)
stored as parquet;
alter table tbA add partition(day= '2021-09-04');
Then I put a parquet file to the target HDFS directory by hdfs dfs -put ...
I can get expected results using select * from tbA in IMPALA.
In Hive, I can get correct result when using
select count(1) from tbA
However, when I use
select * from tbA limit 10
It returns no result at all.
If anything is wrong with the parquet file or the directory, IMPALA should not get the correct result and Hive can count out row numbers... Why select * from ... shows nothing? Any help is appreciated.
In addition,
running select distinct day from tbA, it returns 2021-09-04
running select * from tbA, it returns data with day = 2021-09-04
It seems this partition is not recognized correctly? I retried to drop the partition and use msck repair table but still not working...

Using spark/scala, I use saveAsTextFile() to HDFS, but hiveql("select count(*) from...) return 0

I created external table as follows...
hive -e "create external table temp_db.temp_table (a char(10), b int) PARTITIONED BY (PART_DATE VARCHAR(10)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/work/temp_db/temp_table'"
And I use saveAsTextFile() with scala in IntelliJ IDEA as follows...
itemsRdd.map(_.makeTsv).saveAsTextFile("hdfs://work/temp_db/temp_table/2016/07/19")
So the file(fields terminated by '\t') was in the /work/temp_db/temp_table/2016/07/19.
hadoop fs -ls /work/temp_db/temp_table/2016/07/19/part-00000 <- data file..
But, I checked with hiveql, there are no datas as follows.
hive -e "select count(*) from temp_db.temp_table" -> 0.
hive -e "select * from temp_db.temp_table limit 5" -> 0 rows fetched.
Help me what to do. Thanks.
you are saving at wrong location from spark. Partition dir name follows part_col_name=part_value.
In Spark: save file at directory part_date=2016%2F07%2F19 under temp_table dir
itemsRdd.map(_.makeTsv)
.saveAsTextFile("hdfs://work/temp_db/temp_table/part_date=2016%2F07%2F19")
add partitions: You will need to add partition that should update hive table's metadata (partition dir we have created from spark as hive expected key=value format)
alter table temp_table add partition (PART_DATE='2016/07/19');
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/temp_table/part*|awk '{print $NF}'
/user/hive/warehouse/temp_table/part_date=2016%2F07%2F19/part-00000
/user/hive/warehouse/temp_table/part_date=2016-07-19/part-00000
query partitioned data:
hive> alter table temp_table add partition (PART_DATE='2016/07/19');
OK
Time taken: 0.16 seconds
hive> select * from temp_table where PART_DATE='2016/07/19';
OK
test1 123 2016/07/19
Time taken: 0.219 seconds, Fetched: 1 row(s)
hive> select * from temp_table;
OK
test1 123 2016/07/19
test1 123 2016-07-19
Time taken: 0.199 seconds, Fetched: 2 row(s)
For Everyday process: you can run saprk job like this - just add partitions right after saveAsTextFile(), aslo note the s in alter statement. it is need to pass variable in hive sql from spark:
val format = new java.text.SimpleDateFormat("yyyy/MM/dd")
vat date = format.format(new java.util.Date())
itemsRDD.saveAsTextFile("/user/hive/warehouse/temp_table/part=$date")
val hive = new HiveContext(sc)
hive.sql(s"alter table temp_table add partition (PART_DATE='$date')")
NOTE: Add partition after saving the file or else spark will throw directory already exist exception as hive creates dir (if not exist) when adding partition.

Hive: Insert into hive table with column using select 1

Let's say I have a hive table test_entry with column called entry_id.
hive> desc test_entry;
OK
entry_id int
Time taken: 0.4 seconds, Fetched: 1 row(s)
Suppose I need to insert one row into this above table using select 1 (which returns 1). For example: A syntax which looks like the below:
hive> insert into table test_entry select 1;
But I get the below error:
FAILED: NullPointerException null
So effectively, I would like to insert one row for entry)id whose value will be 1 with such a select statement(without referring another table).
How can this be done?
Hive does not support what you're trying to do. Inserts to ORC based tables was introduced in Hive 0.13.
Prior to that, you have to specify a FROM clause if you're doing INSERT .. SELECT
A workaround might be to create an external table with one row and do the following
INSERT .. SELECT 1 FROM table_with_one_row

Hive Timestamp aggregation

I have two hive tables, in which one table is updating an hourly basic by Java API team (they are calling and storing it into hive table1). And now I have to aggregate the latest data and store it into another table called table2 (data which are loaded newly,because old data have been aggregated and stored). For that I have used the query below:
set maxtime = select max(lastactivitytimestamp) from table2;
insert into table2 select * from table1 where lastactivitytimestamp > unix_timestamp('${hivevar:maxtime}');
I am not getting any result. But when I give the timestamp value manually I am getting data, like below:
insert into table2 select * from table1 where lastactivitytimestamp > unix_timestamp('2014-08-18 15:23:26.754');
Is it possible to pass dynamic values in unix_timestamp?
Try removing the upper commas from the unix_timestamp() function, like this:
insert into table2 select * from table1 where lastactivitytimestamp > unix_timestamp(${hivevar:maxtime});

Creating Views in Hive with parameter

I have a table that contains rows belonging to various dates.
I want to CREATE A VIEW which should give me the data based on the date
CREATE VIEW newusers
AS
SELECT DISTINCT T1.uuid
FROM user_visit T1
WHERE T1.firstSeen="20140522";
I do not want to fix WHERE T1.firstSeen="20140522";
it can be any date like 20140525 etc.
Is there any way that I can create a view with date as parameter?
Not really sure if creating a view with such variable actually works. With Hive 1.2 an onwards, this is what happens when you create table.
hive> create view v_t1 as select * from t_t1 where d1="${hiveconf:v_val_dt}";
OK
Time taken: 6.222 seconds
hive> show create table v_t1;
OK
CREATE VIEW `v_t1` AS select `t_t1`.`f1`, `t_t1`.`d1` from `default`.`t_t1` where `t_t1`.`d1`="'2016-01-02'"
Time taken: 0.202 seconds, Fetched: 1 row(s)
When creating a view, it always takes the static constant value. The one thing that might work would be staying outside the prompt, something like this.
[hdfs#sandbox ~]$ hive -hiveconf v_val_dt=2016-01-01 -e 'select * from v_t1 where d1="${hiveconf:v_val_dt}";'
Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
OK
string_1 2016-01-01
Time taken: 7.967 seconds, Fetched: 1 row(s)
[hdfs#sandbox ~]$ hive -hiveconf v_val_dt=2016-01-06 -e 'select * from v_t1 where d1="${hiveconf:v_val_dt}";'
Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
OK
string_6 2016-01-06
Time taken: 10.967 seconds, Fetched: 1 row(s)
A different approach for this problem is creating a table in which you set key value pairs as parameters. In your view you can then reference this table.
create table load_params (key: string, value: string);
insert overwrite table load_params values ('firstSeen', '20140522');
Your view would then look like this:
create view newusers as
select distinct T1.uuid
from user_visit T1
where T1.firstSeen = (select cast(value as int) from load_params where key = 'firstSeen');
The load_params table can be edited before each run. As you would do when setting a different parameter with set.
In the hive script, just replace the date with a variable:
CREATE VIEW newusers
AS
SELECT DISTINCT T1.uuid
FROM user_visit T1
WHERE T1.firstSeen="${hiveconf:date}";
Then give that variable a value when invoking hive:
hive --hiveconf date=20140522 -f 'create_newusers_view.hql'
Or just set it from within hive:
set date=20140522;

Resources