have a table in Databricks which is mounted to a specific folder in blob storage.
We have to add new columns and we created a new folder with new columns parquet files in blob.
Now we want to rename this new folder to original folder will the table which is pointing to the mount location will be impacted ?
table: STG_EDW.InsuranceDetails (pointing to /mnt/Insurance )
new table: STG_EDW.InsuranceDetails_newcols (pointing to /mnt/Insurance_newcols)
I tried renaming the table and the table went on unusable state and unable to query the new columns.
how can we do in databricks with zero impact. (without dropping and recreating)
Related
After moving my table's data to /some_loc, I am able to change its location with a command such as:
ALTER TABLE db.my_table SET LOCATION '/some_loc'; followed by
MSCK REPAIR TABLE db.my_table;
However, while this does move the table's location, the table is empty. When I do show partitions db.my_table; I see the partitions, but the are referencing the old location. I have to drop and recreate the table for the data to show up.
Is there a way make sure the partitions are pointing to the correct location when I change the location of the table?
Assuming we have 2 hive tables created under the same HDFS file path.
I want to be able to drop a table WITH the HDFS files path, without corrupting the other table that's in the same shared path.
By doing the following:
drop table test;
Then:
hadoop fs -rm -r hdfs/file/path/folder/*
I delete both tables files, not just the one I've dropped.
In another post I found this solution:
--changing the tbl properties to to make the table as internal
ALTER TABLE <table-name> SET TBLPROPERTIES('EXTERNAL'='False');
--now the table is internal if you drop the table data will be dropped automatically
drop table <table-name>;
But I couldn't get passed the ALTER statement as I got a permission denied error (User does not have [ALTER] privilege on table)
Any other solution?
If you have two tables using the same location, then all files in this location belongs to both tables, does not matter how they were created.
Say if you have table1 with location hdfs/file/path/folder and table2 with the same location hdfs/file/path/folder and you inserted some data into table1, files are created and they are being read if you select from table2, and vice-versa: if you insert into table2, new files will be accessible from table1. This is because table data is being stored in the location, no matter how you put the files inside that location. You can insert data into table using SQL, put files into location manually, etc.
Each table or partition has it's location, you cannot specify files separately.
For better understanding, read also this answer with examples about multiple tables on top of the same location: https://stackoverflow.com/a/54038932/2700344
I have following requirements:
- I want to Create a hive table in my user location using the hdfs files in other location.
- I want to copy and not move the files as it is shared by other users.
- The files are already stored by date folder. Each day a new folder will be created and there will be 'n' number of csv files inside the day folder. I want my table to be holding those files data partitioned by date field
- After one time table being created, i want the table to be updated everyday with that day's files
I have data stored in hdfs in the below format and inserted this data in impala partition table using "alter table add partition" command.
/user/impala/subscriber_data/year=2013/month=10/day=01
/user/impala/subscriber_data/year=2013/month=10/day=02
and everything is working fine.
Now I have a new data with month and year as 10 and 01. Now I need to process this data and append this data into existing hdfs directory(year=2013/month=10/day=01).
When I try to process and insert into hdfs directory, its giving error as output directory already exists.
Is there any way to append the new data into existing hdfs directory without deleting the existing directory?
Also, how to insert the new data into existing partition using impala? (I have only table with partition on year,month,day).
to insert into existing partition, you have to drop the existing partition, and add it back with all the files that make up that partition including your new data.
I've created a working hive script to backup data from dynamodb to a file in S3 bucket in AWS. A code snippet is shown below
INSERT OVERWRITE DIRECTORY '${hiveconf:S3Location}'
SELECT *
FROM DynamoDBDataBackup;
When I run the hive script it probably deletes the old file and creates a new file but if there are errors in the backup process I guess it rolls back to the old data because the file is still there when an error has occurred.
Each day we want to make a backup but I need to know if an error has occurred so I want to delete the previous days backup first then create a backup. If it fails then there is no file in the folder which we can automatically detect.
The filename gets automatically named 000000
In my hive script I've tried unsuccesfully:
delete FILE '${hiveconf:S3Location}/000000'
and
delete FILE '${hiveconf:S3Location}/000000.0'
Perhaps the filename is wrong. I haven't set any permissions on the file.
I've just tried this but fails at STORED
SET dynamodb.endpoint= ${DYNAMODBENDPOINT};
SET DynamoDBTableName = "${DYNAMODBTABLE}";
SET S3Location = ${LOCATION};
DROP TABLE IF EXISTS DynamoDBDataBackupPreferenceStore;
CREATE TABLE IF NOT EXISTS DynamoDBDataBackupPreferenceStore(UserGuid STRING,PreferenceKey STRING,DateCreated STRING,DateEmailGenerated STRING,DateLastUpdated STRING,ReceiveEmail STRING,HomePage STRING,EmailFormat STRING,SavedSearchCriteria STRING,SavedSearchLabel STRING),
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
LOCATION '${hiveconf:S3Location}',
TBLPROPERTIES ("dynamodb.table.name" = ${hiveconf:DynamoDBTableName}, "dynamodb.column.mapping" = "UserGuid:UserGuid,PreferenceKey:PreferenceKey,DateCreated:DateCreated,DateEmailGenerated:DateEmailGenerated,DateLastUpdated:DateLastUpdated,ReceiveEmail:ReceiveEmail,HomePage:HomePage,EmailFormat:EmailFormat,SavedSearchCriteria:SavedSearchCriteria,SavedSearchLabel:SavedSearchLabel");
You manage files directly using Hive Table commands
Firstly if you want to use external data controlled outside Hive use the External Command when creating the table
set S3Path='s3://Bucket/directory/';
CREATE EXTERNAL TABLE IF NOT EXISTS S3table
( data STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION ${hiveconf:S3Path};
You can now insert data into this table
INSERT OVERWRITE TABLE S3table
SELECT data
FROM DynamoDBtable;
This will create text files in S3 inside the directory location
Note depending on the data size and number of reducers there may be multiple text files.
Files names are also random GUID element i.e. 03d3842f-7290-4a75-9c22-5cdb8cdd201b_000000
DROP TABLE S3table;
Dropping the table just breaks the link to the files
Now if you want to manage the directory you can create a table that will take control of the S3 directory (Note there is no external command)
CREATE TABLE IF NOT EXISTS S3table
( data STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION ${hiveconf:S3Path};
If you now issue a drop table command all files in the folder are delete immediately
DROP TABLE S3table;
I suggest you create a non external table then drop it and carry on with the rest of your script. If you encounter errors you will have a blank directory after the job finishes
Hope this covers what you need