Hive load multiple partitioned HDFS file to table - hadoop

I have some twice-partitioned files in HDFS with the following structure:
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=1.0/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=1.0/data.parquet
and would like to load these into a hive table as elegantly as possible. I know the typical solution for something like this is to load all the data into a non-partitioned table first, then transfer all the data to final table using dynamic partitioning as mentioned here
However, my files don't have the datekey and coeff values in the actual data, it's only in the filename since that's how it's partitioned. So how would I keep track of these values when I load them into the intermediate table?
One workaround would be to do a separate load data inpath query for each coeff value and datekey. This would not need the intermediate table, but would be cumbersome and probably not optimal.
Are there any better ways for how to do this?

Typical solution is to build external partitioned table on top of hdfs directory:
create external table table_name (
column1 datatype,
column2 datatype,
...
columnN datatype
)
partitioned by (datekey int,
coeff float)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/datascience.db/simulations'
After that, recover all partitions, this command will scan table location and create partitions in Hive metadata:
MSCK REPAIR TABLE table_name;
Now you can query table columns along with partiion columns and do whatever you want with it: use as is, or load into another table using insert .. select .. , etc:
select
column1,
column2,
...
columnN,
--partition columns
datekey,
coeff
from table_name
where datekey = 20210506
;

Related

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?
Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.
There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

Spark write data into partitioned Hive table very slow

I want to store Spark dataframe into Hive table in normal readable text format. For doing so I first did
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
My DataFrame is like:
final_data1_df = sqlContext.sql("select a, b from final_data")
and I am trying to write it by:
final_data1_df.write.partitionBy("b").mode("overwrite").saveAsTable("eefe_lstr3.final_data1")
but this is very slow, even slower than HIVE table write. So to resolve this I thought to define partition through Hive DDL statement and then load data like:
sqlContext.sql("""
CREATE TABLE IF NOT EXISTS eefe_lstr3.final_data1(
a BIGINT
)
PARTITIONED BY (b INT)
"""
)
sqlContext.sql("""
INSERT OVERWRITE TABLE eefe_lstr3.final_data1 PARTITION (stategroup)
select * from final_data1""")
but this is giving partitioned Hive table but still parquet formatted data. Am I missing something here?
When you create the table explicitly then that DDL defines the table.
Normally text file is the default in Hive but it could have been changed in your environment.
Add "STORED AS TEXTFILE" at the end of the CREATE statement to make sure the table is plain text.

How to alter Hive partition column name

I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec,
i.e. For example
In /table/country='US' I can change US to USA, but I want to change country to continent.
I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me.
Thanks in advance.
You can change column name in metadata by following:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment
But as the document says, it only changes the metadata. Hive partitions are implemented as directories with the naming pattern columnName=spec. So you also need to change the names of those directories on HDFS by using "hadoop fs" command.
You have alter the partition column using simple swap method.
Create a new temp table which is same schema as current table.
Move all files in the old table to newly create table location.
hadoop fs -mv <current_table_name> <temp_table_name>
Alter the schema of the original table (Rename or drop the partitions)
Recopy/load the temp table data to the original table with appropriate partition values.
hadoop fs -mv <temp_table_name> <current_table_name>
msck repair the the original table & drop the temp_table.
NOTE : mv command move the file from one location to another with reducing the copy time. alternately we can use LOAD DATA INPATH for copy the data to the original table.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
You can think of it this way - Hive stores the data by creating a folder in hdfs with partition column values - Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/**2009**/file2
tab1/clientdata/**2010**/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
As you said, rename the value for of the partition is very straightforward:
hive> ALTER TABLE test.usage PARTITION (country ='US') RENAME TO PARTITION (date='USA');
I know that this is not what you are looking for. Unfortunately, given that your data is already partitioned by country, the only option you have is to drop the table, remove the data (supposing your table is external) from the HDFS and reinsert the data using continent as partition.
What I would do in your case is to have multiple partition levels, so that your folder structure will look like that:
/path/to/the/data/continent='america'/country='usa'
/path/to/the/data/continent='america'/country='mexico'
/path/to/the/data/continent='europe'/country='spain'
/path/to/the/data/continent='europe'/country='italy'
...
That way you can query the data for different levels of granularity (in this case continent and country).
Adding solution here for later:
Use case: Change partition column from STRING to INT
set hive.mapred.mode=norestrict;
alter table {table_name} partition column ({column_name} {column_type});
e.g. ALTER TABLE employee PARTITION COLUMN dept INT;

Mapping HDFS directory with .tsv files to Hive

I have data into HFDS as a .tsv format. I need to load them into Hive table. I need some help.
Data into HDFS is like:
/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03
Note: Data is loaded into HDFS directory /ad_data/raw/reg_logs daily and hourly.
There are 3 .tsv files into this HDFS directory:
funel1.tsv
funel2.tsv
funel3.tsv
Each .tsv file has 3 columns separated by tab and has data like:
2344 -39 223
2344 -23 443
2394 -43 98
2377 -12 33
...
...
I want to create a Hive schema with 3 columns id int, region_code int and count int, exactly as in HDFS. If possible I want to remove that negative sign, in Hive table but not big deal.
I create a Hive table with schema: (please correct me if I am wrong)
CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING, utc_hour STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/raw/reg_logs';
All I want to do is copy data from HDFS to Hive. I do not want to use "load data inpath '..' into table reg_logs" because I do not want to manually enter data everyday. I just want to point Hive table to HDFS directory so it will get data for each day automatically.
How can I achieve it? Please correct my hive table schema if needed and way to get data there.
==
2nd part:
I want to create another table reg_logs_org which would get populated from reg_logs. I need to put every thing on reg_logs_org from reg_logs beside hour column.
Schema I created is:
CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs_org (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/reg_logs_org';
Insert data into reg_logs_org from reg_logs:
insert overwrite table reg_logs_org
select id, region_code, sum(count), utc_date
from
reg_logs
group by
utc_date, id, region_code
error message:
FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'reg_logs_org'
==
Thank you,
Rio
You're very close. The last step is that you need to add the partition information to Hive's metastore. Hive stores the location of every partition individually, and it does not automatically find new partitions. There are two ways to add the partitions:
Every hour, do an add partition statement:
alter table reg_logs add partition(utc_date='2014-06-11', utc_hour='03')
location '/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03';
Every hour (or less frequently) do a table repair. This scans the root table location for any partitions it has not yet added.
msck repair table reg_logs;
The first approach is a bit more painful, but more efficient. The second approach is easy, but does a full scan of all partitions every time.
Edit: second half of question:
You just need to add some syntax for inserting into a table using dynamic partitions. In general, it is:
insert overwrite [table] partition([partition column])
select ...
Or in your case:
insert overwrite table reg_logs_org partition(utc_date)
select id, region_code, sum(count), utc_date
from
reg_logs
group by
utc_date, id, region_code

Is it possible to have multiple hive tables represented within the same HDFS directory structure?

Is it possible to have multiple hive tables represented within the same HDFS directory structure? In other words, is there a way to have multiple hive tables pointing to same/overlapping HDFS paths?
Here is my situation:
I have a table named "mytable", located in hdfs:/tables/mytable
CREATE EXTERNAL TABLE mytable
(
id int,
...
[a whole bunch of columns]
...
PARTITIONED BY (logname STRING)
STORED AS [I-do-not-know-what-just-yet]
LOCATION 'hdfs:/tables/mytable';
So, HDFS will look like:
hdfs:/tables/mytable/logname=tarzan/....
hdfs:/tables/mytable/logname=jane/....
hdfs:/tables/mytable/logname=whoa/....
Is it possible to have a hive table, named "tarzan", located in hdfs:/tables/mytable/logname=tarzan ? Same with hive table "jane", located in hdfs:/tables/mytable/logname=jane, etc.
The tarzan, jane, whoa, etc sub-tables share some columns (timestamp, ip_address, country, user_id, and some others), but there will also be a lot of columns that they do not have in common.
Is there a way to store this data once in HDFS, and use it for multiple tables as I described above? Furthermore, is there a way to store the data in an efficient way, since many of the tables will have columns that are not in common? Would a file format like RCFILE or PARQUET work in this case?
Thanks so much for any hints or help anyone can provide,
Yes, we can have multiple hive tables with the same underlying HDFS directory.
Example:
Create table emp and load data file file3 into it.
create table emp (id int, name string, salary int)
row format delimited
fields terminated by ','
-- default location would be used
load data
local inpath '/home/parv/testfiles/file3'
into table emp;
Create another table mirror. When you will select data from mirror table, it would be as same as of emp table (contents of file3).
create table mirror (id int, name string, salary int)
row format delimited
fields terminated by ','
location 'hdfs:///user/hive/warehouse/parv.db/base';
Load data into mirror table. When you will select data either from mirror table or emp table, it would return same results (contents of file3 and file4).
load data
local inpath '/home/parv/testfiles/file4'
into table mirror;
Conclusion:
Same data files are shared among both tables emp and mirror.
But, strange, the HDFS filesystem only shows data directory for emp table and not for mirror table. However, both the tables are present in hive and so can be queried.
Answering my own question:
It IS possible to have multiple hive tables represented by the same HDFS directory structure, but for what I am looking to do:
A mytable table partitioned by logname (logname=tarzan, logname=jane, etc...)
A separate table for each logname: A "tarzan" table with only columns used by the tarzan table, and not any other logname, same for the "jane" table, etc
Only represent the data one time in HDFS
A better solution is to have the 1 mytable table, partitioned by logname, AND create views for each logname table, with only the subset of columns needed in each.
Yes, you could point multiple tables to the same location on HDFS. However, Hive doesn't support dynamic columns.
Is there a reason you can't just have 3 different tables? This would allow you do have different schemas (columns) for each.
--Brandon

Resources