I have multiple existing tables stored in hdfs. I would like to create new tables from the existing external tables so that I can bucket, sort, and compress the data.
What is the proper way to create a table from the existing table? I could export the existing table to CSV, then create a new table and import it but it seems like there should be a way to import the data directly from the existing table but I haven't found anything in the documentation or via google.
For some existing table named: source and a newly created table named: target with fields: a,b,c,d
Reading all entries from source and writing to target:
insert overwrite table target select distinct a,b,c,d from source;
This works for both internal and external tables.
Related
Suppose I have an existing schema (s1) with the following tables:
s1.table1
s1.table2
I would like to clone this schema to a new schema (s2) i.e.:
s2.table1
s2.table2
How can I achieve this?
The purpose is for testing without touching the original data.
I'm aware of the EXPORT command but it appears to be able to only export 1 table at a time.
You can copy a table's definition to a new table, without copying the data in the original table using CREATE TABLE ... LIKE:
CREATE TABLE s2.table1 (LIKE s1.table1 INCLUDING ALL);
CREATE TABLE s2.table2 (LIKE s1.table2 INCLUDING ALL);
You can then copy the rows from the original table if needed:
INSERT INTO s2.table1 SELECT * FROM s1.table1;
INSERT INTO s2.table2 SELECT * FROM s1.table2;
I have created some tables in my Hadoop cluster, and I have some parquet tables with data to put it in. How do I perform this? I want to stress, that I already have empty tables, created with some DDL commands, and they are also stored as parquet, so I don't have to create tables, only to import data.
You should take advantage of a hive feature that enables you to use parquet to import data. Even if you don't want to create a new table. I think it's implied that the parquet table schema is the same as the existing empty table. If this isn't the case then below won't work as is. You will have to select the columns that you need. There
Here the table that you already have this is empty is called emptyTable located in myDatabase. The new data you want to add is located /path/to/parquet/hdfs_path_of_parquet_file
CREATE TABLE myDatabase.my_temp_table
LIKE PARQUET '/path/to/parquet/hdfs_path_of_parquet_file'
STORED AS PARQUET
LOCATION '/path/to/parquet/';
INSERT INTO myDatabase.emptyTable as
SELECT * from myDatabase.my_temp_table;
DELETE TABLE myDatabase.my_temp_table;
You said you didn't want to create tables but I think the above kinda cheats around your ask.
The other option again assuming the schema for parquet is already the same as the table definition that is empty that you already have:
ALTER TABLE myDatabase.emptyTable SET LOCATION '/path/to/parquet/';
This technically isn't creating a new table but does require altering you table you already created so I'm not sure if that's acceptable.
You said this is a hive things so I've given you hive answer but really if emptyTable table definition understands parquet in the exact format that you have the /path/to/parquet/hdfs_path_of_parquet_file in you could just drop this file into the folder defined by the table definition:
show create table myDatabase.emptyTable;
This would automatically add the data to the existing table. Provided the table definition matched. Hive is Schema on read so you don't actually need to "import" only enable hive to "interpret".
I have few tables in Hive, every day new csv file will be adding to the hive table location. When a new data is available i need to refresh the tables so that i can see new data in the tables.
steps we follow to load the data:
first create a table with csv serde properties
create another table with parquet table to do in production
insert the data from first table to second table.
Initial:
1,a
2,b
3,c
New file:
4,d
I searched in google and found this can be done via:
1) incremental table, loading the new file in to incremental table and do insert statement. In my case we have more than 100 tables and so not want to create these many incremental tables
2) Using refresh command via Impala shell.
Our initial tables are stored as csv serde format. so when i do refresh on the initial tables i get an error impala does't support serde propertied.
Can you please provide a solution in my case.
I have some data in hdfs.
This data was migrated from a PostgreSQL database by using Sqoop.
The data has the following hadoopish format, like _SUCCESS, part-m-00000, etc.
I need to create a Hive table based on this data and then I need to export this table to a single tab-separated file.
As far as I know, I can create a table this way.
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Then I can save the table as tsv file:
hive -e 'select * from some_table' > /home/myfile.tsv
I don't know how to load data from hdfs into a Hive table.
Moreover, should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
I don't know how to load data from hdfs into Hive table
You create a table schema over a hdfs directory like you're doing.
should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
Unless you didn't tell sqoop to create the table, you must do it manually.
export this table into a single tab-separated file.
A query might work, or unless sqoop set the delimiter to \t, then you need to create another table from the first specifying such column separator. And then, you don't even need to query the table, just run hdfs dfs -getMerge on the directory
In hive, create external table by CTAS is a semantic error, why?
The table created by CTAS is atomic, while external table means data will not be deleted when dropping table, they do not seem to conflict.
In Hive when we create a table(NOT external) the data will be stored in /user/hive/warehouse.
But during External hive table creation the file will be anywhere else, we are just pointing to that hdfs directory and exposing the data as hive table to run hive queries etc.
This SO answer more precisely Create hive table using "as select" or "like" and also specify delimiter
Am I missing something here?
Try this...You should be able to create an external table with CTAS.
CREATE TABLE ext_table LOCATION '/user/XXXXX/XXXXXX'
AS SELECT * from managed_table;
I was able to create one. I am using 0.12.
i think its a semantic error because it misses the most imp parameter of external table definition viz. the External Location of the data file! by definition, 1. External means the data is outside hive control residing outside the hive data warehouse dir. 2. if table is dropped data remains intact only table definition is removed from hive metastore. so,
i. if CTAS is with managed table, the new ext table will have file in warehouse which will be removed with drop table making #2 wrong
ii. if CTAS is with other external table, the 2 tables will point to same file location.
CTAS creates a managed hive table with the new name using the schema and data of the said table.
You can convert it to an external table using:
ALTER TABLE <TABLE_NAME> SET TBLPROPERTIES('EXTERNAL'='TRUE');