Create a hive table in my user location using the hdfs files in other location - hadoop

I have following requirements:
- I want to Create a hive table in my user location using the hdfs files in other location.
- I want to copy and not move the files as it is shared by other users.
- The files are already stored by date folder. Each day a new folder will be created and there will be 'n' number of csv files inside the day folder. I want my table to be holding those files data partitioned by date field
- After one time table being created, i want the table to be updated everyday with that day's files

Related

Find a column name from the list of parquet files in a folder in Synapse using SQL

I am using Azure synapse and have a folder with multiple folders with some parquet files in each folder. When you right click on a parquet file you get the option to select top 100 rows from that file.
I want to write a query there- If I have a column name and I want to find which folder has that column name how do I do that in SQL?

Does external hive table refreshes itself, when file is added to pointing directory

I have a directory in HDFS, everyday one processed file is placed in that directory with DateTimeStamp in file name, if I create external table on top of that Directory location, does external table refreshes itself when every day file comes and resides in that directory ??
If you add files into table directory or partition directory, does not matter, external or managed table in Hive, the data will be accessible for queries, you do not need to do any additional steps to make data available, no refresh is necessary.
Hive table/partition is a metadata (DDL, location, statistics, access permissions, etc) plus data files in the location. So, data is stored in the table/partition location in HDFS.
Only if you create new directory for new partition which is not created yet, then you will need to execute ALTER TABLE ADD PARTITION LOCATION=<new location> or MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS.
If you add files into already created table/partition locations, no refresh is necessary.
CBO can use statistics for query calculation without reading data files, for example count(*). It works for simple queries only, like count(*), max().
If you are using CBO with statistics for query calculation, you may need to refresh it using ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS. See this answer for more details: https://stackoverflow.com/a/39914232/2700344
If you do not need statistics and want your table location to be scanned every time you query it, switch it off: set hive.compute.query.using.stats=false;

Creating HBase table for files in HDFS directory

I am trying to load all files data in a HDFS directory into HBase existed table.Can you please share me how to load all files data and incremental data into HBase table.
I created HBase table as
hbase>create 'sample','cf'
I have to copy
hdfs://ip:port/user/test
into sample hbase table.please suggest me any solution.
Answer 1:(possible)
ImportTSV, if you try providing /user/hadoop/ directory path only instead of full file path, it should process all files with in that dir.
Answer 2:(seems not possible)
The special column name HBASE_ROW_KEY is used to designate that this
column should be used as the row key for each imported record. You
must specify exactly one column to be the row key, and you must
specify a column name for every column that exists in the input data.

Impala paritioned table with hdfs

I have data stored in hdfs in the below format and inserted this data in impala partition table using "alter table add partition" command.
/user/impala/subscriber_data/year=2013/month=10/day=01
/user/impala/subscriber_data/year=2013/month=10/day=02
and everything is working fine.
Now I have a new data with month and year as 10 and 01. Now I need to process this data and append this data into existing hdfs directory(year=2013/month=10/day=01).
When I try to process and insert into hdfs directory, its giving error as output directory already exists.
Is there any way to append the new data into existing hdfs directory without deleting the existing directory?
Also, how to insert the new data into existing partition using impala? (I have only table with partition on year,month,day).
to insert into existing partition, you have to drop the existing partition, and add it back with all the files that make up that partition including your new data.

How to Delete a 000000 file in S3 bucket in AWS using a hive script

I've created a working hive script to backup data from dynamodb to a file in S3 bucket in AWS. A code snippet is shown below
INSERT OVERWRITE DIRECTORY '${hiveconf:S3Location}'
SELECT *
FROM DynamoDBDataBackup;
When I run the hive script it probably deletes the old file and creates a new file but if there are errors in the backup process I guess it rolls back to the old data because the file is still there when an error has occurred.
Each day we want to make a backup but I need to know if an error has occurred so I want to delete the previous days backup first then create a backup. If it fails then there is no file in the folder which we can automatically detect.
The filename gets automatically named 000000
In my hive script I've tried unsuccesfully:
delete FILE '${hiveconf:S3Location}/000000'
and
delete FILE '${hiveconf:S3Location}/000000.0'
Perhaps the filename is wrong. I haven't set any permissions on the file.
I've just tried this but fails at STORED
SET dynamodb.endpoint= ${DYNAMODBENDPOINT};
SET DynamoDBTableName = "${DYNAMODBTABLE}";
SET S3Location = ${LOCATION};
DROP TABLE IF EXISTS DynamoDBDataBackupPreferenceStore;
CREATE TABLE IF NOT EXISTS DynamoDBDataBackupPreferenceStore(UserGuid STRING,PreferenceKey STRING,DateCreated STRING,DateEmailGenerated STRING,DateLastUpdated STRING,ReceiveEmail STRING,HomePage STRING,EmailFormat STRING,SavedSearchCriteria STRING,SavedSearchLabel STRING),
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
LOCATION '${hiveconf:S3Location}',
TBLPROPERTIES ("dynamodb.table.name" = ${hiveconf:DynamoDBTableName}, "dynamodb.column.mapping" = "UserGuid:UserGuid,PreferenceKey:PreferenceKey,DateCreated:DateCreated,DateEmailGenerated:DateEmailGenerated,DateLastUpdated:DateLastUpdated,ReceiveEmail:ReceiveEmail,HomePage:HomePage,EmailFormat:EmailFormat,SavedSearchCriteria:SavedSearchCriteria,SavedSearchLabel:SavedSearchLabel");
You manage files directly using Hive Table commands
Firstly if you want to use external data controlled outside Hive use the External Command when creating the table
set S3Path='s3://Bucket/directory/';
CREATE EXTERNAL TABLE IF NOT EXISTS S3table
( data STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION ${hiveconf:S3Path};
You can now insert data into this table
INSERT OVERWRITE TABLE S3table
SELECT data
FROM DynamoDBtable;
This will create text files in S3 inside the directory location
Note depending on the data size and number of reducers there may be multiple text files.
Files names are also random GUID element i.e. 03d3842f-7290-4a75-9c22-5cdb8cdd201b_000000
DROP TABLE S3table;
Dropping the table just breaks the link to the files
Now if you want to manage the directory you can create a table that will take control of the S3 directory (Note there is no external command)
CREATE TABLE IF NOT EXISTS S3table
( data STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION ${hiveconf:S3Path};
If you now issue a drop table command all files in the folder are delete immediately
DROP TABLE S3table;
I suggest you create a non external table then drop it and carry on with the rest of your script. If you encounter errors you will have a blank directory after the job finishes
Hope this covers what you need

Resources