Creating hive table using configuration file - hadoop

I know basic concepts of HIVE. My query is creating the hive table using the external configuration/schema file.
I know the basic query to create the hive table where we pass the column header and datatype in the create table statement . That is nothing but we hard code it.
But I wanted to create the hive table where it takes the column header and datatype from the external configuration file. Can it be done in Hive? It’s fine even we are supposed to write the unix shell script to achieve it but I’m not sure about it.
Below is the format of my configuration file :
Config.txt
id,Integer(2),NOT NULL
name,String(20)
state,String(5),NOT NULL
phone_no,Integer(4)
gender,Char(1)
As of now i have created one .hql file where i have written the hive create table statement script and calling the .hql file in the bash script file.
Below are the .hql file and .sh file:
hiveQ.hql:
create table goodrecs(
id int,
name string,
state string,
phone_no int,
gender string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/hduser/Dataparse/goodrec' INTO TABLE goodrecs;
testscript.sh :
#!/bin/bash
hive -f hiveQ.hql
In hiveQ.hql i wanted column headers and datatype should come from the config.txt file.
How this can be done ?
Thanks in advance

It is very convenient to change config.txt to a standard hql file,use a map which turns types in config.txt to hive column type such as integer to int,char to string.

Related

No rows selected when trying to load csv file in hdfs to a hive table

I have a csv file called test.csv in hdfs. The file was placed there through filezilla. I am able to view the path as well as the contents of the file when I log in to Edge node through putty using the same account credentials that I used to place the file into hdfs. I then connect to Hive and try to create an external table specifying the location of my csv file in hdfs using the statement below:
CREATE EXTERNAL TABLE(col1 string, col2 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC LOCATION '/file path'
when I execute this command it is creating an external table on hive but the table that is being created is empty with only the columns showing up which i have already mentioned in the create statement. My question is, am I specifying the correct path in the location parameter in the create statement above? I tried using the path which I see on filezilla when I placed my csv file into hdfs which is in the format home/servername/username/directory/subdirectory/file
but this returns an error saying the user whose username is specified in the path above does not have ALL privileges on the file path.
NOTE: I checked the permissions on the file and the directory in which it resides and the user has all permissions(read,write and execute).
I then tried changing the path into the format user/username/directory/subdirectory/file and when I did this I was able to create the external table however the table is empty and does not load all the data in the csv file on which it was created.
I also tried the alternative method of creating an internal table as below and then using the LOAD DATA INPATH command. But this also failed as I am getting an error saying that "there are no files existing at the specified path".
CREATE TABLE foobar(key string, stats map<string, bigint>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':' ;
LOAD DATA INPATH '/tmp/foobar.csv' INTO TABLE foobar;
First thing you can't load csv file directly into Hive table which is specified with orc file format while creating. Orc is a compression technique to store data in optimised way. So you can load your data into orc format table by following below steps.
You should create a temp table as text file format.
Load data into it by using the command.
hive> load data in path.....
or else u can use location parameter while creating the table itself.
Now create a hive table as your required file format (RC, ORC, parquet, etc).
-Now load data into it by using following command.
hive> insert overwrite into table foobar as select * from temptbl;
You will get table in orc file format.
In second issue is if you Load data into the table by using LOAD DATA command, the data which is in your file will become empty and new dir will be created in default location (/user/hive/warehouse/) with the table name and data will moved into that file. So check in that location you will see the data.

Using bash to send hive script a variable number of fields

I'm automating a data pipeline by using a bash script to move csvs to HDFS and build external Hive tables on them. Currently, this only works when the format of the table is predefined in an .hql file. But I want to be able to read the headers from the CSV and send them as arguments to Hive. So currently I do this inside a loop through the files:
# bash
hive -S -hiveconf VAR1=$target_db -hiveconf VAR2=$filename -hiveconf VAR3=$target_folder/$filename -f create_tables.hql
Which is sent to this...
-- hive
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(
individual_pkey INT,
response CHAR(1)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/${hiveconf:VAR3}'
I want the hive script to look more like this...
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(
${hiveconf:ROW1} ${hiveconf:TYPE1},
... ...
${hiveconf:ROW_N} ${hiveconf:TYPE_N}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/${hiveconf:VAR3}'
Is it possible to send it some kind of array that it would parse? Is this feasible or advisable?
I eventually figured out a way around this.
You can't really write an HQL script that takes in a variable number of fields. You can, however, write a bash script that generates an HQL script of variable length. I've implemented this for my team, but the general idea is to write out how you want the HQL to look as a string in bash, then use something like Rscript to read in and identify the data types of your CSV. Store the data types as an array along with the CSV headers and then loop through those arrays, writing the information to the HQL.

hive: external partitioned table without location

Is it possible to create external partitioned table without location? I want to add all the locations later, together with partitions.
i tried:
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
PARTITIONED BY day;
but i got ParseException: missing EOF at 'PARTITIONED' near 'TEXTFILE'
I don't think so, as said in alter location.
But anyway, i think your query as some errors and the correct script would be :
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
PARTITIONED BY (day String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
;
I think the issue is that you have not specified data type for your partition column "day". And you can create a HIVE external table without location and can use ALTER table options later to change the location.

In Oozie, how can I redirect the output of a query to a file?

In Oozie, I have used Hive action in Hue. and I want to redirect the output of the query to a file. How can I generate those file?
My HQL is :
select * from emptable
where day>=${fromdate} and day<=${todate}
My HiveServer Action contains:
a. HQL script
b. Two parameters options one for each dates like as fromdate = , todate =
c. Added file hive-site.xml.
My question is how can I redirect the output of a query to a file
You would need to execute the Shell action which is not recommended, a better solution might be to do a
INSERT OVERWRITE DIRECTORY '/path' SELECT * FROM TABLE
Another Alternate option is by creating external Table in Hive,
Example
CREATE EXTERNAL TABLE table_name(col type,col2 type) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION '/path';
EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir

Hive table not retrieving rows from external file

I have a text file called as sample.txt. The file looks like:
abc,23,M
def,25,F
efg,25,F
I am trying to create a table in hive using:
CREATE EXTERNAL TABLE ppldb(name string, age int,gender string)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/path/to/sample.txt';
But the data isn't getting into the table. When I run the query:
select count(*) from ppldb
I get 0 in output.
What could be the reason for data not getting loaded into the table?
The location in a external table in Hive should be an HDFS directory and not the full path of the file.
If that directory does not exists then the location we give will be created automatically. In your case /path/to/sample.txt is being treated as a directory.
So just give the /path/to/ in the LOCATION and keep the sample.txt file inside the directory. It will work.
Hope it helps...!!!
the LOCATION clause indicates where the table will be stored, not where to retrieve data from. After moving the samples.txt file into hdfs with something like
hdfs dfs -copyFromLocal ~/samples.txt /user/tables/
you could load the data into a table in hive with
create table temp(name string, age int, gender string)
row format delimited fields terminated by ','
stored as textfile;
load data inpath '/user/tables/samples.txt' into table temp;
That should work

Resources