How to create partitioned hive table on dynamic hdfs directories - hadoop

I am having difficulty in getting hive to discover partitions which are created in HDFS
Here's the directory structure in HDFS
warehouse/database/table_name/A
warehouse/database/table_name/B
warehouse/database/table_name/C
warehouse/database/table_name/D
A,B,C,D being values from a column type
when I create a hive table using the following syntax
CREATE EXTERNAL TABLE IF NOT EXISTS
table_name(`name` string, `description` string)
PARTITIONED BY (`type` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs:///tmp/warehouse/database/table_name'
I am unable to see any records when I query the table.
But when I create directories in HDFS as below
warehouse/database/table_name/type=A
warehouse/database/table_name/type=B
warehouse/database/table_name/type=C
warehouse/database/table_name/type=D
It works and discovers partitions when I check using show partitions table_name
Is there some configuration in hive to able to detect dynamic directories as partitions?

Creating external table on top of some directory is not enough, partitions needs to be mounted also. Discover partitions feature added in Hive 4.0.0. Use MSCK REPAIR TABLE for earlier versions:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
or it's equivalent on EMR:
ALTER TABLE table_name RECOVER PARTITIONS;
And when you creating dynamic partitions using insert overwrite, partition metadata is being created automatically and partition folders are in the form key=value.

Related

Create Impala Table from HDFS Directory with subdirectories

I have a directory, such as /user/name/folder.
Inside this directory, I have more sub-directories named dt=2020-06-01, dt=2020-06-02, dt=2020-06-03, etc.
These directories contain parquet files. They all have the same schema.
Is it possible to create an Impala table using /user/name/folder?
Each time I do, I get a Table with 0 records. Is there a way to tell Impala to pull the parquet files from all of the sub-directories?
One way to do that is loading data with static partitioning in which you manually define the different partitions. With static partitioning, you create a partition manually, using an ALTER TABLE … ADD PARTITION statement,
and then load the data into the partition.
CREATE TABLE customers_by_date
(cust_id STRING, name STRING)
PARTITIONED BY (dt STRING)
STORED AS PARQUET;
ALTER TABLE customers_by_country
ADD PARTITION (dt='2020-06-01')
SET LOCATION '/user/name/folder/dt=2020-06-01';
If the location is not specified then the location is created
ALTER TABLE customers_by_date
ADD PARTITION (dt='2020-06-01');
and you could load data with HDFS commands too
$ hdfs dfs -cp /user/name/folder/dt=2020-06-01 /user/directory_impala/table/partition
You could follow these links to the Cloudera documentation for further details:
Partitioning for Impala Tables
Impala Create table statement
Impala Alter table statement

Hive: Add partitions for existing folder structure

I have a folder structure in HDFS like below. However, no partitions were actually created on the table using the ALTER TABLE ADD PARTITION commands, even though the folder structure was setup as if the table had partitions.
How can I automatically add all the partitions to the Hive table? (Hive 1.0, external table)
/user/frank/clicks.db
/date=20190401
/file0004.csv
/date=20190402
/file0009.csv
/date=20190501
/file0000.csv
/file0001.csv
...etc
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or
ALTER TABLE tablename RECOVER PARTITIONS;
if you are running Hive on EMR.
Read more details about both commands here: RECOVER PARTITIONS

Hive query not reading partition field

I created a partitioned Hive table using the following query
CREATE EXTERNAL TABLE `customer`(
`cid` string COMMENT '',
`member` string COMMENT '',
`account` string COMMENT '')
PARTITIONED BY (update_period string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://nameservice1/user/customer'
TBLPROPERTIES (
'avro.schema.url'='/user/schema/Customer.avsc')
I'm writing to the partitioned location using map reduce program. when I read the output files using avro tools it is showing the correct data in json format. But when I use hive query to display the data, nothing is displayed. If I don't use partition field during table creation then the values are displayed in hive. what could be the reason for this ? I specify the output location for the mapreduce program as "/user/customer/update_period=201811".
Do I need to add anything in the mapreduce program configuration to resolve this?
You need to run msck repair table once you have loaded a new partition in HDFS location.
Why we need to run msck Repair table statement everytime after each ingestion?
Hive stores a list of partitions for each table in its metastore. However new partitions are directly added to HDFS , the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of below ways to add the newly add partitions.
1.Adding each partition to the table
hive> alter table <db_name>.<table_name> add partition(`date`='<date_value>')
location '<hdfs_location_of the specific partition>';
(or)
2.Run metastore check with repair table option
hive> Msck repair table <db_name>.<table_name>;
which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?
Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.
There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

Insert partitioned data into partitioned hive table

I have stored the data in hdfs using Pig Multistorage with the column id.
So data stored as
/output/1/part-0000
/output/2/
/output/3/
Now I have created a partitioned table in hive and I want to load the data from /output folder into this partitioned table. Is there any way to achieve this?
First you create a temp hive table where you load all the data from pig output.
Then You load to your actual partitioned hive table from temp table.
Something like below:
FROM emp_external temp INSERT OVERWRITE TABLE emp_partition PARTITION(country) SELECT temp.id,temp.name,temp.dept,temp.sal,temp.country;
Else you can explore Hcatlog for this case.
not sure if you are looking to insert the data in the outputfolder (created from pig) to an existing table or loading the data in the output folder in to a new hive partitioned table.
If you want to load the data in to new hive table, you can create a new partitioned table pointing to the output folder
If you are looking to load the data into an existing hive table, then you can either create a temp table as #Aman mentioed and do a insert in to the destination table
or
You can just move/copy the files in the hdfs from output/ to hive table location.
Hope this helps
Assign a Hive schema to pig output location with partitioned columns (Alter table Add Partition) as column id. Now both are hive tables and you can use where clause over partitioned column to move over the data.

Resources