User ALTER TABLE ... CONCATENATE with partial matching partitions in Hive - hadoop

I want to use the ALTER TABLE ... CONCATENATE functionality in Hive, but it seems I have to give exact partition name. For example I have a table with two partition columns, date and group. I'd like to be able to do something like this:
alter table mytable partition (insert_date='2017-04-11',group='%') CONCATENATE;
But i can't find the way of doing it.

Concatenate doesn't support this.

Related

How to create table in Hive with specific column values from another table

I am new to Hive and have some problems. I try to find a answer here and other sites but with no luck... I also tried many different querys that come to my mind, also without success.
I have my source table and i want to create new table like this.
Were:
id would be number of distinct counties as auto increment numbers and primary key
counties as distinct names of counties (from source table)
You could follow this approach.
A CTAS(Create Table As Select)
with your example this CTAS could work
CREATE TABLE t_county
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE AS
WITH t AS(
SELECT DISTINCT county, ROW_NUMBER() OVER() AS id
FROM counties)
SELECT id, county
FROM t;
You cannot have primary key or foreign keys on Hive as you have primary key on RBDMSs like Oracle or MySql because Hive is schema on read instead of schema on write like Oracle so you cannot implement constraints of any kind on Hive.
I can not give you the exact answer because of it suppose to you must try to do it by yourself and then if you have a problem or a doubt come here and tell us. But, what i can tell you is that you can use the insertstatement to create a new table using data from another table, I.E:
create table CARS (name string);
insert table CARS select x, y from TABLE_2;
You can also use the overwrite statement if you desire to delete all the existing data that you have inside that table (CARS).
So, the operation will be
CREATE TABLE ==> INSERT OPERATION (OVERWRITE?) + QUERY OPERATION
Hive is not an RDBMS database, so there is no concept of primary key or foreign key.
But you can add auto increment column in Hive. Please try as:
Create table new_table as
select reflect("java.util.UUID", "randomUUID") id, countries from my_source_table;

How to drop hive column?

I have two columns Id and Name in Hive table, and I want to delete the Name column. I have used following command:
ALTER TABLE TableName REPLACE COLUMNS(id string);
The result was that the Name column values were assigned to the Id column.
How can I drop a specific column of the table and is there any other command in Hive to achieve my goal?
In addition to the existing answers to the question : Alter hive table add or drop column
As per Hive documentation,
REPLACE COLUMNS removes all existing columns and adds the new set of columns.
REPLACE COLUMNS can also be used to drop columns. For example, ALTER TABLE test_change REPLACE COLUMNS (a int, b int); will remove column c from test_change's schema.
The query you are using is right. But this will modify only schema i.e, the metastore. This will not modify anything on data side.
So, before you are dropping the column you should make sure that you hav correct data file.
In your case the data file should not contain name values.
If you don't want to modify the file then create another table with only specific column that you need.
Create table tablename as select id from already_existing_table
let me know if this helps.

HIVE: How create a table with all columns in another table EXCEPT one of them?

When I need to change a column into a partition (convert normal column as partition column in hive), I want to create a new table to copy all columns except one. I currently have >50 columns in the original table. Is there any clean way of doing that?
Something like:
CREATE student_copy LIKE student EXCEPT age and hair_color;
Thanks!
You can use a regex:
CTAS using REGEX column spec. :
set hive.support.quoted.identifiers=none;
CREATE TABLE student_copy AS SELECT `(age|hair_color)?+.+` FROM student;
set hive.support.quoted.identifiers=column;
BUT (as mentioned by Kishore Kumar Suthar :
this will not create a partitioned table, as that is not supported with CTAS (Create Table As Select).
Only way I see for you to get your partitioned table is by getting the complete create statement of the table (as mentioned by Abraham):
SHOW CREATE TABLE student;
Altering it to create a partition on the column you want. And after that you can use the select with regex when inserting into the new table.
If your partition column is already part of this select, then you need to make sure it is the last column you insert. If it is not you can exclude that column in the regex and including it as last. Also if you expect several partitions to be created based on your insert statement you need to enable 'dynamic partitioning':
set hive.support.quoted.identifiers=none;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE student_copy PARTITION(partcol1) SELECT `(age|hair_color|partcol1)?+.+`, partcol1 FROM student;
set hive.support.quoted.identifiers=column;
the 'hive.support.quoted.identifiers=none' is required to use the backticks '`' in the regex part of the query. I set this parameter to it's original value after my statement: 'hive.support.quoted.identifiers=column'
CREATE TABLE student_copy LIKE student;
It just copies the source table definition.
CREATE TABLE student_copy AS select name, age, class from student;
Target cannot be partitioned table.
Target cannot be external table.
It copies the structure as well as the data
I use below command to get the create statement of existing table.
SHOW CREATE TABLE student;
Copy the result and modify that based on your requirement for new table and run the modified command to get the new table.

How to alter Hive partition column name

I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec,
i.e. For example
In /table/country='US' I can change US to USA, but I want to change country to continent.
I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me.
Thanks in advance.
You can change column name in metadata by following:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment
But as the document says, it only changes the metadata. Hive partitions are implemented as directories with the naming pattern columnName=spec. So you also need to change the names of those directories on HDFS by using "hadoop fs" command.
You have alter the partition column using simple swap method.
Create a new temp table which is same schema as current table.
Move all files in the old table to newly create table location.
hadoop fs -mv <current_table_name> <temp_table_name>
Alter the schema of the original table (Rename or drop the partitions)
Recopy/load the temp table data to the original table with appropriate partition values.
hadoop fs -mv <temp_table_name> <current_table_name>
msck repair the the original table & drop the temp_table.
NOTE : mv command move the file from one location to another with reducing the copy time. alternately we can use LOAD DATA INPATH for copy the data to the original table.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
You can think of it this way - Hive stores the data by creating a folder in hdfs with partition column values - Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/**2009**/file2
tab1/clientdata/**2010**/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
As you said, rename the value for of the partition is very straightforward:
hive> ALTER TABLE test.usage PARTITION (country ='US') RENAME TO PARTITION (date='USA');
I know that this is not what you are looking for. Unfortunately, given that your data is already partitioned by country, the only option you have is to drop the table, remove the data (supposing your table is external) from the HDFS and reinsert the data using continent as partition.
What I would do in your case is to have multiple partition levels, so that your folder structure will look like that:
/path/to/the/data/continent='america'/country='usa'
/path/to/the/data/continent='america'/country='mexico'
/path/to/the/data/continent='europe'/country='spain'
/path/to/the/data/continent='europe'/country='italy'
...
That way you can query the data for different levels of granularity (in this case continent and country).
Adding solution here for later:
Use case: Change partition column from STRING to INT
set hive.mapred.mode=norestrict;
alter table {table_name} partition column ({column_name} {column_type});
e.g. ALTER TABLE employee PARTITION COLUMN dept INT;

How to partition large Hive table with many categories

I want to partition my table in hive so that for every unique item in the row it creates a partition. There are ~250 partitions for about a 4 billion row table so I would like to to something like a for loop or a distinct. Here is my thoughts in code (which obviously have not worked)
ALTER TABLE myTable ADD IF NOT EXISTS
PARTITION( myColumn = distinct myColumn);
or is there some kind of loop in Hive?
Does this require a UDF? A hive answer would be preferable if possible.
Thanks.
just use dynamic partitions
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DynamicpartitionInsert
it does the partition creation on the go

Resources