Hive, create table ___ like ___ stored as ___ - hadoop

I have a table in hive stored as text files. I want to move all the data into another table with the same schema but stored as sequence files.
How do I create the second table? I wanted to use the hive create table like command but it doesn't support as sequencefile
hive> create table test_sq like test_t stored as sequencefile;
FAILED: ParseException line 1:33 missing EOF at 'stored' near 'test_t'
I am looking for a programmatic way so that I can replicate the same process for more tables.

CREATE TABLE test_t LIKE test_sq;
It just copies the source table definition.The new table contains no rows. As you said you have to move all the data. In this case above query is not suitable;
try this,
CREATE TABLE test_sq row format delimited fields terminated by '|' STORED AS sequencefile AS select * from test_t;
Target cannot be partitioned table.
Target cannot be external table.
It copies the structure as well as the data
Note - if you don't want to give row format delimited then remove from query. You can give where clause also in query to copy selected rows;

Try using create + insert together.
Use the normal DDL statement to create the table.
CREATE TABLE test2 (a INT) STORED AS SEQUENCEFILE
then use
INSERT INTO test2 AS SELECT * FROM test;
test is the table with Textfile as data format and 'test2' is the table with SEQUENCEFILE data format.

Related

How do I convert a sequence file to parquet format

I have a HIVE Table (test) that I need to create in the PARQUET format. I will be using a bunch of SEQUENCE files in order to create and insert into a table.
Once the table is created, is there a way to convert into PARQUET? I mean I know we could have done, say
CREATE TABLE default.test( user_id STRING, location STRING)
PARTITIONED BY ( dt INT ) STORED AS PARQUET
initially while creating the table itself. However, in my case I am forced to use SEQUENCE files to create the table first because it is the format that I have to begin with and cannot directly convert to PARQUET.
Is there a way I could convert into parquet after the table is created and data inserted?
To convert form sequence file to Parquet you need to load the data (CTAS) into a new table.
The question is tagged with presto, so I am giving you Presto syntax for this. I am including partitioning, because example in the question contains it.
CREATE TABLE test_parquet WITH(format='PARQUET', partitioned_by=ARRAY['dt']) AS
SELECT * FROM test_sequencefile;

How to delete fields from a partitioned table in Hive stored as parquet?

I'm looking for a way to modify a parquet data table in HIVE to remove some fields. The table is managed but it doesn't matter because I can convert it to external.
The problem is that I can not use the command ALTER TABLE ... REPLACE COLUMN with partitioned parquet tables.
It is works well for textfile format (partitioned or not) and only for non-partitioned parquet tables.
I've tried to replace column but this is the result:
hive> ALTER TABLE db_test.mytable REPLACE COLUMNS(name String);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Replacing columns cannot drop columns for table db_test.mytable.
SerDe may be incompatible
I've thought about some solutions, but none of them fits my scenario:
First
- [Optional] Convert the table in external.
- Delete the table.
- Re-create the table with the fields that I want.
- MSCK REPAIR TABLE to add HDFS partitions.
- [Optional] Convert back to managed table.
Second
- Create temporary table as selection of the original table with the fields that I choose.
- Delete the original table.
- Rename the temporary table to the original name.
Both options affect my process because I would lose the statistics of my table. This table is consumed with MicroStrategy by Impala and I need to mantain the statistics.
In addition, the second solution is bad with very large data tables.
Any suggestions?
Thanks in advance.
You can use first method and then run
hive> anayze table <db_name>.<table_name> compute statistics;
to compute all the statistics of the table.

Create partitioned table from non partitioned table

Suppose I have internal orc non partitioned table in Hive:
CREATE TABLE IF NOT EXISTS non_partitioned_table(
id STRING,
company STRING,
city STRING,
country STRING,
)
STORED AS ORC;
Is it possible somehow create parquet partitioned table this way via cte like statement?
create partitioned_table PARTITION ON (date STRING) like non_partitioned_table;
alter table partitioned_table SET FILEFORMAT PARQUET;
This create statement doesn't work.
So basically I need to add column and make table partitioned by this column. I know that I can create table through the simple create table statement, but I need to do it within CREATE TABLE LIKE and the altered somehow
Your table doesn't have a date column to begin with, so you're going to have to make a new one.
You might be able to ALTER TABLE non_partitioned_table ADD PARTITION, but haven't tried that myself. If you want to try it, I would suggest the partition location be outside of the existing HDFS directory.
Anyways, the CREATE-TABLE-LIKE DDL does not support PARTITIONED BY
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
You need to copy the DESCRIBE TABLE schema from the first, then alter it and add the PARTITIONED BY, and optionally specify STORED AS. (SET FILEFORMAT PARQUET doesn't change the data type in-place).
Then, if you want the data in the new table, you need to INSERT OVERWRITE TABLE

Spark write data into partitioned Hive table very slow

I want to store Spark dataframe into Hive table in normal readable text format. For doing so I first did
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
My DataFrame is like:
final_data1_df = sqlContext.sql("select a, b from final_data")
and I am trying to write it by:
final_data1_df.write.partitionBy("b").mode("overwrite").saveAsTable("eefe_lstr3.final_data1")
but this is very slow, even slower than HIVE table write. So to resolve this I thought to define partition through Hive DDL statement and then load data like:
sqlContext.sql("""
CREATE TABLE IF NOT EXISTS eefe_lstr3.final_data1(
a BIGINT
)
PARTITIONED BY (b INT)
"""
)
sqlContext.sql("""
INSERT OVERWRITE TABLE eefe_lstr3.final_data1 PARTITION (stategroup)
select * from final_data1""")
but this is giving partitioned Hive table but still parquet formatted data. Am I missing something here?
When you create the table explicitly then that DDL defines the table.
Normally text file is the default in Hive but it could have been changed in your environment.
Add "STORED AS TEXTFILE" at the end of the CREATE statement to make sure the table is plain text.

How to let CREATE TABLE...AS SELECT in HIVE do not populate data?

When I run CTAS in HIVE, the data is also populated simultaneously. But I just want to create the table, but not populate the data. How and what I should do? Thanks.
You can do that by using the LIKE keyword.
create table new_table_name LIKE old_table_name
This will create the table structure without the data.
Use create EXTERNAL table instead of create table. Observe External keyword.
Use where condition in select statement and give a value of where which fetches no records from hive.
Example table name demo1
id name country
1 abc India
2 xyz Germany
3 pqr France
In CREATE TABLEā€¦AS SELECT in HIVE
Create table demo2...As SELECT id, name, country from demo1 where id=0;
So, in above where condition of id is given as 0 and from above data the select statement will fetch no record, similarly choose a value in where condition which returns no records. Hence no data will be inserted in newly created table.
#Sunil's answer helped me as well, I am just posting an addition that was necessary in my case.
The source table was in Avro format but the new one I wanted in ORC, hence,
CREATE TABLE dataaggregate_orc_empty LIKE dataaggregate_avro_compressed ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' TBLPROPERTIES ('orc.compress'='ZLIB');
The above step can be split in two steps, if required :
CREATE TABLE dataaggregate_orc_empty LIKE dataaggregate_avro_compressed;
alter table dataaggregate_orc_empty set fileformat ORC;
I would be glad if someone provides inputs for the data format changes that occur in this process and related problems, if any.

Resources