Can External Tables in Hive Intelligently Identify Partitions? - hadoop

I need to run this whenever I need to mount partition. Rather than me doing it manually is there a way to auto detect partition in external hive tables
ALTER TABLE TableName ADD IF NOT EXISTS PARTITION()location 'locationpath';

Recover Partitions (MSCK REPAIR TABLE)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
MSCK REPAIR TABLE table_name;
partitions will be add automatically

Using dynamic partition, the directory does not need to be created manually. But dynamic partition mode needs to be set to nonstrict, by default it is strict
CREATE External TABLE profile (
userId int
)
PARTITIONED BY (city String)
location '/user/test/profile';
set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert into profile partition(city)
select * from nonpartition;
hive> select * from profile;
OK
1 Chicago
1 Chicago
2 Orlando
and in HDFS
[cloudera#quickstart ~]$ hdfs dfs -ls /user/test/profile
Found 2 items
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Chicago
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Orlando

Related

How to find recently updated values in hive without using time stamp

I have a table like
id name sal
1 Saa 45000
2 aaa 33000
after incremental load
id name sal
3 bbb 55000
How to get only recently updated value without time stamp
The easiest and most efficient way is using a partition. You can have a partitioned table and create a new partition every time you do the incremental load. This way the latest partition will only have the latest records.
Please be noted that very frequent incremental loads can lead to a lot of small partitions which may not be an optimum data design.
There can be couple other ways of doing this but that purely depends on what is your use case, the data rate, and volume.
Hope that helps!
Create a table.
hive> create table student(id int, name string);
OK
Time taken: 3.503 seconds
Insert one record into the table.
hive> insert into student values(1, 'first');
hive> select * from student;
OK
1 first
Time taken: 0.109 seconds, Fetched: 1 row(s)
Use below command on Hive terminal to find the location of the table. i.e meta store location of student table.
hive> describe formatted student;
You should get the details as shown below.
# Detailed Table Information
Database: retaildb
Owner: root
CreateTime: Thu Mar 08 15:52:47 PST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student
Now Check the content
[root#quickstart cloudera]# hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student
Found 1 items
-rwxr-xr-x 1 root supergroup 8 2018-03-08 15:53 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/*
1first
Insert one more record.
hive> select * from student;
OK
1 first
1 second
Time taken: 0.095 seconds, Fetched: 2 row(s)
Check the metastore location.
[root#quickstart cloudera]# hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/
Found 2 items
-rwxr-xr-x 1 root supergroup 8 2018-03-08 15:53 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0
-rwxr-xr-x 1 root supergroup 9 2018-03-08 15:57 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0_copy_1
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0_copy_1
1second
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/*
1first
1second

Why does querying an external hive table require write access to the hdfs directory?

I've hit an interesting permissions problem when setting up an external table to view some Avro files in Hive.
The Avro files are in this directory :
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
The server can write to this file, but regular users cannot.
As the database admin, I create an external table in Hive referencing this directory:
hive> create external table test_table (data string) stored as avro location '/server/data/avrofiles';
Now as a regular user I try to query the table:
hive> select * from test_table limit 10;
FAILED: HiveException java.security.AccessControlException: Permission denied: user=regular.joe, access=WRITE, inode="/server/data/avrofiles":myserver:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
Weird, I'm only trying to read the contents of the file using hive, I'm not trying to write to it.
Oddly, I don't get the same problem when I partition the table like this:
As database_admin:
hive> create external table test_table_partitioned (data string) partitioned by (value string) stored as avro;
OK
Time taken: 0.104 seconds
hive> alter table test_table_partitioned add if not exists partition (value='myvalue') location '/server/data/avrofiles';
OK
As a regular user:
hive> select * from test_table_partitioned where value = 'some_value' limit 10;
OK
Can anyone explain this?
One interesting thing I noticed is that the Location value for the two tables are different and have different permissions:
hive> describe formatted test_table;
Location: hdfs://server.companyname.com:8020/server/data/avrofiles
$ hadoop fs -ls /apps/hive/warehouse/my-database/
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
user cannot write
hive> describe formatted test_table_partitioned;
Location: hdfs://server.companyname.com:8020/apps/hive/warehouse/my-database.db/test_table_partitioned
$ hadoop fs -ls /apps/hive/warehouse/my-database.db/
drwxrwxrwx - database_admin hadoop 0 2017-01-04 14:04 /apps/hive/warehouse/my-database.db/test_table_partitioned
anyone can do anything :)

After Static Partitioning output is not as expected in hive

I am working with Static Partitioning
data for processing is as follows
Id Name Salary Dept Doj
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789, Dev,2012
4,Asif ,145687, Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Admin,2011
8,Mohsin,569875,2012
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,563895,Qa,2010
12,Anuj,546987,Dev,2010
Hql For creating Partitionng table and loading data into it is as follows
create external table if not exists murtaza.PartSalaryReport (ID int,Name
string,Salary string,Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2010);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2011);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2012);
Select * from murtaza.PartSalaryReport;`
Now Proble is that in my hdfs location where external table is located i should get data directory wise so upto that its ok
`
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables`
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 4 items
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2011
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2012
But when i look into data inside
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
it shows data of all 2010,2011 and 2012 , though it should show only 2010 data
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables/doj=2010
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 1 items
-rwxr-xr-x 3 cts573151 supergroup 270 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010/partition.txt
[cts573151#aster2 ~]$ hadoop dfs -cat /user/cts573151/externaltables/doj=2010/partition.txt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789,Dev,2012
4,Asif,145687,Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Qa,2011
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,53895,Qa,2010
12,Anuj,54987,Dev,2010
[cts573151#aster2 ~]$
Where its wrong ???
Since you are creating external table in hive, so you have to follow the below sets of commands:
create external table if not exists murtaza.PartSalaryReport (
ID int, Name string, Salary string, Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
alter table murtaza.PartSalaryReport add partition (Doj=2010);
hdfs dfs -put /home/cts573151/partition1.txt /user/cts573151/externaltables/Doj=2010/
alter table murtaza.PartSalaryReport add partition (Doj=2011);
hdfs dfs -put /home/cts573151/partition2.txt /user/cts573151/externaltables/Doj=2011/
alter table murtaza.PartSalaryReport add partition (Doj=2012);
hdfs dfs -put /home/cts573151/partition3.txt /user/cts573151/externaltables/Doj=2012/
These commands work for me, Hoping it helps you!!!

Apache hive MSCK REPAIR TABLE new partition not added

I am new for Apache Hive. While working on external table partition, if I add new partition directly to HDFS, the new partition is not added after running MSCK REPAIR table. Below are the codes I tried,
-- creating external table
hive> create external table factory(name string, empid int, age int) partitioned by(region string)
> row format delimited fields terminated by ',';
--Detailed Table Information
Location: hdfs://localhost.localdomain:8020/user/hive/warehouse/factory
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1438579844
-- creating directory in HDFS to load data for table factory
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
-- Table data
cat factory1.txt
emp1,500,40
emp2,501,45
emp3,502,50
cat factory2.txt
EMP10,200,25
EMP11,201,27
EMP12,202,30
-- copying from local to HDFS
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory1.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory2.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
-- Altering table to update in the metastore
hive> alter table factory add partition(region='southregion') location '/user/hive/testing/testing1/factory2';
hive> alter table factory add partition(region='northregion') location '/user/hive/testing/testing1/factory1';
hive> select * from factory;
OK
emp1 500 40 northregion
emp2 501 45 northregion
emp3 502 50 northregion
EMP10 200 25 southregion
EMP11 201 27 southregion
EMP12 202 30 southregion
Now I created new file factory3.txt to add as new partition for the table factory
cat factory3.txt
user1,100,25
user2,101,27
user3,102,30
-- creating the path and copying table data
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory3'
now I executed the below query to update the metastore for the new partition added
MSCK REPAIR TABLE factory;
Now the table is not giving the new partition content of factory3 file. Can I know where I am doing mistake while adding partition for table factory?
whereas, if I run the alter command then it is showing the new partition data.
hive> alter table factory add partition(region='eastregion') location '/user/hive/testing/testing1/factory3';
Can I know why the MSCK REPAIR TABLE command is not working?
For the MSCK to work, naming convention /partition_name=partition_value/ should be used. For example in the root directory of table;
# hadoop fs -ls /user/hive/root_of_table/*
/user/hive/root_of_table/day=20200101/data1.parq
/user/hive/root_of_table/day=20200101/data2.parq
/user/hive/root_of_table/day=20200102/data3.parq
/user/hive/root_of_table/day=20200102/data4.parq
When you run msck repair table <tablename> partitions of day; 20200101 and 20200102 will be added automatically.
You have to put data in directory named 'region=eastregio' in table location directory:
$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'
$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'

Unable to see created database in hive in specified location

I'm new to Hive.
I've created a database in Hive and by default the database is created in Hive warehouse. When I run the -ls against Hive Warehouse I'm able to see created database practice.db.
Query Used to Create Database:
create database practice
COMMENT 'Holds all practice tables';
I've created another database in Hive. When I run the -ls command against the path where I created the database unable to see practice_first.db.
Query Used to Create Database:
create database practice_first
COMMENT 'Holds all practice tables'
LOCATION '/somepath in hdfs here';
Even I checked in Hive warehouse practice_first.db is not there in hive warehouse also.
When I run show database I'm able to see practice_first database in list of databases.
Any suggestion where Hive created the practice_first.db.
Thanks in Advance.
Just tested to make sure it should appear without any tables being added and it does. Try using describe on the database after you create to verify the path:
hive> create database test1 COMMENT 'testing' LOCATION '/user/testuser/test1.db';
OK
Time taken: 0.63 seconds
hive> describe database test1;
OK
test1 testing hdfs://nameserviceHA/user/testuser/test1.db testuser USER
Time taken: 0.183 seconds, Fetched: 1 row(s)
hive> exit;
$ hadoop fs -ls /user/testuser/ | grep test
drwxr-xr-x - testuser testgrp 0 2015-02-12 15:43 /user/testuser/test1.db
Doing the same with the default location will automatically name the dir with a .db extension. Also it looks like the group in this case will be hive instead of the user's group:
hive> drop database test1;
OK
Time taken: 0.379 seconds
hive> create database test1 COMMENT 'testing'; OK
Time taken: 0.319 seconds
hive> describe database test1;
OK
test1 testing hdfs://nameserviceHA/user/hive/warehouse/test1.db testuser USER
Time taken: 0.263 seconds, Fetched: 1 row(s)
hive> exit;
$ hadoop fs -ls /user/hive/warehouse | grep test1
drwxrwxrwt - testuser hive 0 2015-02-12 15:53 /user/hive/warehouse/test1.db
It will be created at the path specified by LOCATION
Check HIVE-1537 and this
You will be able to see practice_first.db directory if
1) you create the database in default location.
2) you explicitly provide the name of the directory as practice_first.db in LOCATION in CREATE DATABASE statement.

Resources