Hive: Table creation with multi-files with multiple directories - hadoop

I want to create a Hive table where the input textfiles are traversed onto multiple sub-directories in hdfs. So example I have in hdfs:
/testdata/user/Jan/part-0001
/testdata/user/Feb/part-0001
/testdata/user/Mar/part-0001
and so on...
If i want to create a table user in hive, but have it be able to traverse the sub-directories of user, can that be done? I tried something like this, but doesn't work;
CREATE EXTERNAL TABLE users (id int, name string)
STORED AS TEXTFILE LOCATION '/testdata/user/*'
I thought adding the wildcard would work but doesn't. When I tried not using wildcard still does not work. However, if I copy the files into the root directory of user, then it works. Is there no way for Hive to traverse to the child-directories, and grab those files?

You can create an external table, then add subfolders as partitions.
CREATE EXTERNAL TABLE test (id BIGINT) PARTITIONED BY ( yymmdd STRING);
ALTER TABLE test ADD PARTITION (yymmdd = '20120921') LOCATION 'loc1';
ALTER TABLE test ADD PARTITION (yymmdd = '20120922') LOCATION 'loc2';

I ended up using a shell script like below for a use case where the sub-directories are not known a-priori.
#!/bin/bash
hive -e "CREATE EXTERNAL TABLE users (id int, name string) PARTITIONED BY (month string) STORED AS TEXTFILE LOCATION '/testdata/user/'; "
hscript=""
for part in `hadoop fs -ls /testdata/user/ | grep -v -P "^Found"|grep -o -P "[a-zA-Z]{3}$"`;
do
echo $part
tmp="ALTER TABLE users ADD PARTITION(month='$part');"
hscript=$hscript$tmp
done;
hive -e "$hscript"

Hive uses subdirectories as partitions of the data, so simply:
CREATE EXTERNAL TABLE users (id int, name string) PARTITIONED BY (month string)
STORED AS TEXTFILE LOCATION '/testdata/user/'
That should do it for you.

CREATE EXTERNAL TABLE user (id int, name string);
LOAD DATA INPATH "/testdata/user/*/*" INTO TABLE users;

Don't put * after the /testdata/user/ because path hive will take all sub directories automatically.
If you want to make partitions then make the HDFS folder like /testdata/user/year=dynamicyear/month=dynamicmonth/date=dynamicdate
After creating the table with partition then use msck repair table tablename.
CREATE EXTERNAL TABLE users (id int, name string)
STORED AS TEXTFILE LOCATION '/testdata/user/'

Related

Timestamp partitioning in Hive

I am trying to create timestamp based partition in hive. But hive is creating data based partition. Below is my code. Could someone please help?
cat test1.sh
dat=`date +'%Y%m%d %H:%m:%S'`
hive -f load.hql -hiveconf file_load_timestamp=$dat;
cat load.hql
INSERT OVERWRITE table perm.test partition(file_load_timestamp='${hiveconf:dat}')
SELECT a,b FROM work.temp;
dt=20180102/ = HDFS path is getting created like this.
dt=20180102 103455/ = Expecting HDFS path to be created like this.
When I tried with %Y%m%d_%H:%m:%S' format its working as expected. But I need space between date and timestamp.
To create a folder name in HDFS with space in between, it is required to escape the space with \
hadoop fs -mkdir test\ 123
create a folder in hdfs with name test 123.
Similarly, hive maintains the partitions in folders created with the partition value. Thats why providing the date format %Y%m%d\ %H%m%S will help to create folder with spaces.
Below is tested and working:
INSERT OVERWRITE table person_details1 partition(datelocal='20180102\ 200128') select * from person_details;
datelocal is String
Edited:Executed the code, Below is working one:
hduser#Amit:~$ cat test1.sh
#!/bin/sh
dat=`date +'%Y%m%d\ %H%m%S'`
hive -f load.hql -hiveconf datelocal="$dat";
hduser#Amit:~$ cat load.hql
INSERT OVERWRITE table amit.person_details1 partition(datelocal='${hiveconf:datelocal}') select * from amit.person_details;

how to load load multiple files into table in hive?

There is a directory which contains multiple files yet to be analyzed, for example, file1, file2, file3.
I want to
load data inpath 'path/to/*' overwrite into table demo
instead of
load data inpath 'path/to/file1' overwrite into table demo
load data inpath 'path/to/file2' overwrite into table demo
load data inpath 'path/to/file3' overwrite into table demo.
However, it just doesn't work. Are there any easier ways to implement this?
1.
load data inpath is an HDFS metadata operation.
The only thing it does is moving files from their current location to the table location.
And again, "moving" (unlike "copying") is a metadata operation and not data operation.
2.
If the OVERWRITE keyword is used then the contents of the target table
(or partition) will be deleted and replaced by the files referred to
by filepath; otherwise the files referred by filepath will be added to
the table.
Language Manual DML-Loading files into tables
3.
load data inpath 'path/to/file1' into table demo;
load data inpath 'path/to/file2' into table demo;
load data inpath 'path/to/file3' into table demo;
or
load data inpath 'path/to/file?' into table demo;
or
dfs -mv path/to/file? ...{path to demo}.../demo
or (from bash)
hdfs dfs -mv path/to/file? ...{path to demo}.../demo
Generating a hive table with the path as the LOCATION parameter will automatically read all the files in said location.
for example:
CREATE [EXTERNAL] TABLE db.tbl(
column1 string,
column2 int ...)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY (delimiter)
LINES TERMINATED BY '\n'
LOCATION '/path/to/' <- DO NOT POINT TO A SPECIFIC FILE, POINT TO THE DIRECTORY
Hive will will automatically parse all data within the folder and will "force feed" it to the table statement you created.
as long as all files in that path are in the same format you are good to go.
1) Directory contains three files
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:53 /hallfolder/hall.csv
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:53 /hallfolder/hall1.csv
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:54 /hallfolder/hall2.csv
2) Enable this command
SET mapred.input.dir.recursive=true;
3) hive>
load data inpath '/hallfolder/*' into table alltable;
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
filepath can be:
a relative path, such as project/data1 an absolute path, such as /user/hive/project/data1 a full URI with scheme and (optionally) an authority, such as hdfs://namenode:9000/user/hive/project/data1
The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
filepath can refer to a file (in which case Hive will move the file into the table) or it can be a directory (in which case Hive will move all the files within that directory into the table). In either case, filepath addresses a set of files.

Creating hive table: no files matching path file... but the file exist in the path

Im trying to create a hive orc table using a file stored in hdfs.
I have a table "partsupp.tbl" file where each line have the below format:
1|25002|8076|993.49|ven ideas. quickly even packages print. pending multipliers must have to are fluff|
I create a hive table like this:
create table if not exists partsupp (PS_PARTKEY BIGINT,
PS_SUPPKEY BIGINT,
PS_AVAILQTY INT,
PS_SUPPLYCOST DOUBLE,
PS_COMMENT STRING)
STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
;
Now Im trying to load the data in the .tbl file in the table like this:
LOAD DATA LOCAL INPATH '/tables/partsupp/partsupp.tbl' INTO TABLE partsupp
But Im getting this issue:
No files matching path file:/tables/partsupp/partsupp.tbl
But the files exists in the hdfs...
LOCAL signifies that file is present on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
So in this case, use following query:
LOAD DATA INPATH '/tables/partsupp/partsupp.tbl' INTO TABLE partsupp

add date time from flat file name cloudera

I started an EC2 cluster on amazon to install cloudera...I got it installed and configured and loaded some of the Wiki Page Views public snapshot into HDFS. The structure of the files are as such:
projectcode, pagename, pageviews, bytes
the files are named as such:
pagecounts-20090430-230000.gz
date time
when loading the data from HDFS to Impala, I do it as such:
CREATE EXTERNAL TABLE wikiPgvws
(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '/user/hdfs';
one thing I missed is the date and time of each of the file. The dir:
/user/hdfs
contains multiple pagecount files associated with different dates and times. How can one pull that information and store it in a column when loading to impala?
I think the thing you are missing is the concept of partitions. If you define the table as partitioned, the data may be divided to different partitions based on the timestamp(in the name) of the file. I'm able to work around it in hive, I hope you to do the needful(if any) for impala as there query syntax is the same.
For me, this problem is not possible to solve only using hive. So I mixed up bash with hive scripting and it works fine for me. This is how I wrapped it up :
Create table wikiPgvws with partition
Create table wikiTmp with same fields as wikiPgvws except for partitions
For each file
i. Load data into wikiTmp
ii. grep timeStamp from fileName
iii. Use sed to replace placeholders in a predefined hql script file to load the data to the actual table. Then run it.
Drop table wikiTmp & remove tmp.hql
The script is as follows :
#!/bin/bash
hive -e "CREATE EXTERNAL TABLE wikiPgvws(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
PARTITIONED BY(dts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE";
hive -e "CREATE TABLE wikiTmp(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE"
for fileName in $(hadoop fs -ls /user/hdfs/bounty/pagecounts-*.txt | grep -Po '(?<=\s)(/user.*$)')
do
echo "currentFile :$fileName"
dst=$(echo $filename | grep -oE '[0-9]{8}-[0-9]{6}')
echo "currentStamp $dst"
sed "s!sourceFile!'$fileName'!" t.hql > tmp.hql
sed -i "s!targetPartition!$dst!" tmp.hql
hive -f tmp.hql
done
hive -e "DROP TABLE wikiTmp"
rm -f tmp.hql
The hql script consists of just two lines :
LOAD DATA INPATH sourceFile OVERWRITE INTO TABLE wikiTmp;
INSERT OVERWRITE TABLE wikiPgvws PARTITION (dts = 'targetPartition') SELECT w.* FROM wikiTmp w;
Epilogue :
Check, whether options equivalent to hive -e & hive -f are available in impala. Without them, this script is of no use to you. Again the grep commands to fetch the fileName & timeStamp need to be modified according to your table location and stamp pattern. It's just one a way to show how the job can be done, but couldn't able to find another one.
Enhencement
If everything works well, consider merging the first two DDLs into another script to make it look cleaner. Although, I'm not sure that hql script arguments can be used to define partition values, still you can have a look to replace sed.

Checking the table existence and loading the data into Hbase and HIve table

I have data in HDFS. And I wanted to load that data into hbase and hive table.
I have written a bash shell script in which I have written a pig script to load the data form HDFS to HBASE and also written hive script to load the data from HDFS to HIVE table which are working perfectly fine.Here my HDFS data files are with the same structure and I'm loading all the data files into single hbase and hive table.
Now my query is suppose if I receive some more data files in HDFS directory and if I run the shell script again it will create hbase and hive table again with the same name and tells table already exists. How can I write a hive and hbase query so that 1st it will check for the table existence, if table does not exists it create the table for the 1st time and load the data from HDFS to HBASE & Hive table. If the table is already exists then it will just insert the data into an existing hbase and hive table. It should not overwrite the data alreday exists in the tables.
How this can be done ?
Below is my script file: myScript.sh
echo "create 'goodtable','gt'" | hbase shell
pig -f a.pig -param input=/user/user/d/
hive -f h.hql
Where a.pig :
G = LOAD '$input' USING PigStorage(',') as (c1:chararray, c2:chararray,c3:chararray,c4:chararray,c5:chararray);
STORE G INTO 'hbase://goodtable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('gt:name gt:state gt:phone_no gt:gender');
h.hql:
create external table hive_table(
id int,
name string,
state string,
phone_no int,
gender string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/user/d/' INTO TABLE hive_table;
I just wanted to add an example for HBase as Hive was already covered before:
if [[ $(echo "exists 'goodtable'" | hbase shell | grep 'not exist') ]];
then
echo "create 'goodtable','gt'" | hbase shell;
fi
For HIVE, you can add the command IF NOT EXISTS in the CREATE TABLE statement. See the documentation
I don't have much experience on Hbase, but I believe you can use EXISTS table_name command to check whether the table exists and then create the table is it doesn't exist. See here
#visakh is correct - you can see if table exists in HBase by entering the HBase shell, and typing : exists '<tablename>
In order to do this without entering the HBase shell interactively, you can create a simple ruby script such as the following:
exists 'mytable'
exit
Let's say you save this to a file called tabletest.rb. You can then execute this script by calling hbase shell tabletest.rb. This will create the following output, which you can then parse from your shell script:
Table tableisthere does exist
0 row(s) in 0.9830 seconds
OR
Table tableisNOTthere does not exist
0 row(s) in 0.9830 seconds
Adding more details for 'all in one' script:
Alternatively, you can create a more advanced script in ruby that checks for table existence and then will create it if needed - this is done calling the HBaseAdmin java api from within the ruby script.
conf = HBaseConfiguration.new
hbaseAdmin = HBaseAdmin.new(conf)
if !hbaseAdmin.tableExists('mytable')
hbaseAdmin.createTable('mytable',...)
end

Resources