Hive external table location in google cloud storage is ignoring subdirectories - hadoop

I have a bunch of large csv.gz files in google cloud storage that we got from an external source. We need to bring this in BigQuery so we can start querying but BigQuery cannot directly ingest CSV GZIPPED files larger than 4GB. So, I decided to convert these files into Parquet format and then load in BigQuery.
Let's take example of the websites.csv.gz file, which is under path gs://<BUCKET-NAME>/websites/websites.csv.gz.
Now, for this I wrote a Hive script as below -
CREATE EXTERNAL TABLE websites (
col1 string,
col2 string,
col3 string,
col4 string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'gs://<BUCKET-NAME>/websites/'
TBLPROPERTIES ('skip.header.line.count'='1');
msck repair TABLE websites;
CREATE EXTERNAL TABLE par_websites (
col1 string,
col2 string,
col3 string,
col4 string
) STORED AS PARQUET LOCATION 'gs://<BUCKET-NAME>/websites/par_websites/';
INSERT OVERWRITE TABLE par_websites
SELECT *
FROM websites;
This works well and creates a new folder par_websites as in specified location gs://<BUCKET-NAME>/websites/par_websites/ which has the one parquet file inside it.
But when the website.csv.gz file is in a subfolder e.g. gs://<BUCKET-NAME>/data/websites/ and I update the script to have read and write locations as gs://<BUCKET-NAME>/data/websites/ and gs://<BUCKET-NAME>/data/websites/par_websites, it does not work at all. Hive does not seem to read from gs://<BUCKET-NAME>/data/websites/websites.csv.gz and instead of creating par_websites folder inside gs://<BUCKET-NAME>/data/websites, it creates a new folder gs://<BUCKET-NAME>/websites/par_websites with no parquet file inside.
Why is that and how can I make Hive read and write from subfolders?

Hive was caching my previous table names, so when I was updating it was still showing the older version and not updating.
Once I changed the name and it processed again, all worked well.

Related

How to store multiple files under the same directory in hive?

I'm using Hive to process my CSV files. I've stored CSV files in HDFS and wanna create tables from those files.
I use the following command:
create external table if not exists csv_table (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive'
TBLPROPERTIES ("skip.header.line.count"="1");
LOAD DATA INPATH '/CsvData/csv_table.csv' OVERWRITE INTO TABLE csv_table;
So the file under /CsvData will be moved into /user/hive. It makes sense.
But how if I want to create another table?
create external table if not exists csv_table2 (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive'
TBLPROPERTIES ("skip.header.line.count"="1");
LOAD DATA INPATH '/CsvData/csv_table2.csv' OVERWRITE INTO TABLE csv_table2;
It will raise an exception complaining that the directory is not empty.
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Directory hdfs://localhost:9000/user/hive could not be cleaned up.
So it is hard for me to understand, does it mean I can store only one file understand one directory? To store multiple files I have to create one directory for every file?
Is it possible to store all the files together?
Create table sentence will NOT raise an exception complaining that the directory is not empty because it is quite normal scenario when you create table on top of existing directory.
You can store as many files in the directory as necessary. And all of them will be accessible to the table built on top of the folder.
Table location is directory, not file. If you need to create new table and keep it's files not mixed with other table then create separate folder.
Read also this answer for clear understanding: https://stackoverflow.com/a/54038932/2700344

How to point one Hive Table to Multiple External Files?

I would like to be able to append multiple HDFS files to one Hive table while leaving the HDFS files in their original directory. These files are created are located in different directories.
The LOAD DATA INPATH moves the HDFS file to the hive warehouse directory.
As far as I can tell, an External Table must be pointed to one file, or to one directory within which multiple files with the same schema can be placed. However, my files would not be underneath a single directory.
Is it possible to point a single Hive table to multiple external files in separate directories, or to otherwise copy multiple files into a single hive table without moving the files from their original HDFS location?
Expanded Solution off of Pradeep's answer:
For example, my files look like this:
/root_directory/<job_id>/input/<dt>
Pretend the schema of each is (foo STRING, bar STRING, job_id STRING, dt STRING)
I first create an external table. However, note that my DDL does not contain an initial location, and it does not include the job_id and dt fields:
CREATE EXTERNAL TABLE hivetest (
foo STRING,
bar STRING
) PARTITIONED BY (job_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;
Let's say I have two files I wish to insert located at:
/root_directory/b1/input/2014-01-01
/root_directory/b2/input/2014-01-02
I can load these two external files into the same Hive table like so:
ALTER TABLE hivetest
ADD PARTITION(job_id = 'b1', dt='2014-01-01')
LOCATION '/root_directory/b1/input/2014-01-01';
ALTER TABLE hivetest
ADD PARTITION(job_id = 'b2', dt='2014-01-02')
LOCATION '/root_directory/b2/input/2014-01-02';
If anyone happens to require the use of Talend to perform this, they can use the tHiveLoad component like so [edit: This doesn't work; check below]:
The code talend produces for this using tHiveLoad is actually LOAD DATA INPATH ..., which will remove the file off its original location in HDFS.
You will have to do the earlier ALTER TABLE syntax in a tHiveLoad instead.
The short answer is yes. A Hive External Table can be pointed to multiple files/directories. The long answer will depend on the directory structure of your data. The typical way you do this is to create a partitioned table with the partition columns mapping to some part of your directory path.
E.g. We have a use case where an external table points to thousands of directories on HDFS. Our paths conform to this pattern /prod/${customer-id}/${date}/. In each of these directories we have approx 100 files. In mapping this into a Hive Table, we created two partition columns, customer_id and date. So every day, we're able to load the data into Hive, by doing
ALTER TABLE x ADD PARTITION (customer_id = "blah", dt = "blah_date") LOCATION '/prod/blah/blah_date';
Try this:
LOAD DATA LOCAL INPATH '/path/local/file_1' INTO TABLE tablename;
LOAD DATA LOCAL INPATH '/path/local/file_2' INTO TABLE tablename;

How to add partition using hive by a specific date?

I'm using hive (with external tables) to process data stored on amazon S3.
My data is partitioned as follows:
DIR s3://test.com/2014-03-01/
DIR s3://test.com/2014-03-02/
DIR s3://test.com/2014-03-03/
DIR s3://test.com/2014-03-04/
DIR s3://test.com/2014-03-05/
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_04-20_00-49.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_06-26_19-56.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_15-20_12-53.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_22-54_27-19.log
How to create a partition table using hive?
CREATE EXTERNAL TABLE test (
foo string,
time string,
bar string
) PARTITIONED BY (? string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';
Could somebody answer this question ? Thanks!
First start with the right table definition. In your case I'll just use what you wrote:
CREATE EXTERNAL TABLE test (
foo string,
time string,
bar string
) PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';
Hive by default expects partitions to be in subdirectories named via the convention s3://test.com/partitionkey=partitionvalue. For example
s3://test.com/dt=2014-03-05
If you follow this convention you can use MSCK to add all partitions.
If you can't or don't want to use this naming convention, you will need to add all partitions as in:
ALTER TABLE test
ADD PARTITION (dt='2014-03-05')
location 's3://test.com/2014-03-05'
If you have existing directory structure that doesn't comply <partition name>=<partition value>, you have to add partitions manually. MSCK REPAIR TABLE won't work unless you structure your directory like so.
After you specify location on table creation like:
CREATE EXTERNAL TABLE test (
foo string,
time string,
bar string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';
You can add partition without specifying full path:
ALTER TABLE test ADD PARTITION (dt='2014-03-05') LOCATION '2014-03-05';
Although I've never checked it, I suggest you to move your partitions into a folder inside the bucket, not directly in the bucket itself. E.g. from s3://test.com/ to s3://test.com/data/.
If you are going to partition using date field you need s3 folder structure as mentioned below:
s3://test.com/date=2014-03-05/ip-foo-request-2014-03-05_04-20_00-49.log
In such case you can create external table with partition column as date
and run MSCK REPAIR TABLE EXTERNAL_TABLE_NAME to update hive meta store.
Please look at the response posted above by Carter Shanklin. You need to make sure your files are stored in the directory structure as partitionkey=partitionvalue i.e. Hive by default expects partitions to be in subdirectories named via the convention.
In your example it should be stored as
s3://test.com/date=20140305/ip-foo-request-2014-03-05_04-20_00-49.log.
Steps to be followed:
i) Make sure data exists in the above structure
ii) Create the external table
iii) Now run the msck repair table.
I think the the data is present in the s3 location and might not updated in the metadata, (emrfs). In order this to work first do emrfs import and emrfs sync.
And then apply the msck repair.
It will add all the partitions that are present in s3

How to Delete a 000000 file in S3 bucket in AWS using a hive script

I've created a working hive script to backup data from dynamodb to a file in S3 bucket in AWS. A code snippet is shown below
INSERT OVERWRITE DIRECTORY '${hiveconf:S3Location}'
SELECT *
FROM DynamoDBDataBackup;
When I run the hive script it probably deletes the old file and creates a new file but if there are errors in the backup process I guess it rolls back to the old data because the file is still there when an error has occurred.
Each day we want to make a backup but I need to know if an error has occurred so I want to delete the previous days backup first then create a backup. If it fails then there is no file in the folder which we can automatically detect.
The filename gets automatically named 000000
In my hive script I've tried unsuccesfully:
delete FILE '${hiveconf:S3Location}/000000'
and
delete FILE '${hiveconf:S3Location}/000000.0'
Perhaps the filename is wrong. I haven't set any permissions on the file.
I've just tried this but fails at STORED
SET dynamodb.endpoint= ${DYNAMODBENDPOINT};
SET DynamoDBTableName = "${DYNAMODBTABLE}";
SET S3Location = ${LOCATION};
DROP TABLE IF EXISTS DynamoDBDataBackupPreferenceStore;
CREATE TABLE IF NOT EXISTS DynamoDBDataBackupPreferenceStore(UserGuid STRING,PreferenceKey STRING,DateCreated STRING,DateEmailGenerated STRING,DateLastUpdated STRING,ReceiveEmail STRING,HomePage STRING,EmailFormat STRING,SavedSearchCriteria STRING,SavedSearchLabel STRING),
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
LOCATION '${hiveconf:S3Location}',
TBLPROPERTIES ("dynamodb.table.name" = ${hiveconf:DynamoDBTableName}, "dynamodb.column.mapping" = "UserGuid:UserGuid,PreferenceKey:PreferenceKey,DateCreated:DateCreated,DateEmailGenerated:DateEmailGenerated,DateLastUpdated:DateLastUpdated,ReceiveEmail:ReceiveEmail,HomePage:HomePage,EmailFormat:EmailFormat,SavedSearchCriteria:SavedSearchCriteria,SavedSearchLabel:SavedSearchLabel");
You manage files directly using Hive Table commands
Firstly if you want to use external data controlled outside Hive use the External Command when creating the table
set S3Path='s3://Bucket/directory/';
CREATE EXTERNAL TABLE IF NOT EXISTS S3table
( data STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION ${hiveconf:S3Path};
You can now insert data into this table
INSERT OVERWRITE TABLE S3table
SELECT data
FROM DynamoDBtable;
This will create text files in S3 inside the directory location
Note depending on the data size and number of reducers there may be multiple text files.
Files names are also random GUID element i.e. 03d3842f-7290-4a75-9c22-5cdb8cdd201b_000000
DROP TABLE S3table;
Dropping the table just breaks the link to the files
Now if you want to manage the directory you can create a table that will take control of the S3 directory (Note there is no external command)
CREATE TABLE IF NOT EXISTS S3table
( data STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION ${hiveconf:S3Path};
If you now issue a drop table command all files in the folder are delete immediately
DROP TABLE S3table;
I suggest you create a non external table then drop it and carry on with the rest of your script. If you encounter errors you will have a blank directory after the job finishes
Hope this covers what you need

Is it possible to import data into Hive table without copying the data

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.
Can I avoid having all my text data stored twice?
EDIT: I load it via the following command
LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Then, I can find the exact same file in:
/user/hive/warehouse/sandbox.db/test/day=20130220
I assumed it was copied.
use an external table:
CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/logs/';
if you want to use partitioning with an external table, you will be responsible for managing the partition directories.
the location specified must be an hdfs directory..
If you drop an external table hive WILL NOT delete the source data.
If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.
I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.
LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Notice the LOCAL
You can use alter table partition statement to avoid data duplication.
create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';
ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';
Hive (atleast when running in true cluster mode) can not refer to external files in local file system. Hive can automatically import the files during table creation or load operation. The reason behind this can be that Hive runs MapReduce jobs internally to extract the data. MapReduce reads from the HDFS as well as writes back to HDFS and even runs in distributed mode. So if the file is stored in local file system, it can not be used by the distributed infrastructure.

Resources