In Oozie, how can I redirect the output of a query to a file? - hadoop

In Oozie, I have used Hive action in Hue. and I want to redirect the output of the query to a file. How can I generate those file?
My HQL is :
select * from emptable
where day>=${fromdate} and day<=${todate}
My HiveServer Action contains:
a. HQL script
b. Two parameters options one for each dates like as fromdate = , todate =
c. Added file hive-site.xml.
My question is how can I redirect the output of a query to a file

You would need to execute the Shell action which is not recommended, a better solution might be to do a
INSERT OVERWRITE DIRECTORY '/path' SELECT * FROM TABLE

Another Alternate option is by creating external Table in Hive,
Example
CREATE EXTERNAL TABLE table_name(col type,col2 type) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION '/path';
EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir

Related

data deleted from hdfs after using hive load command [duplicate]

When load data from HDFS to Hive, using
LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename;
command, it looks like it is moving the hdfs_file to hive/warehouse dir.
Is it possible (How?) to copy it instead of moving it, in order, for the file, to be used by another process.
from your question I assume that you already have your data in hdfs.
So you don't need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See here:
Create Table DDL
eg.:
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Please note that the format you use might differ from the default (as mentioned by JigneshRawal in the comments). You can use your own delimiter, for example when using Sqoop:
row format delimited fields terminated by ','
I found that, when you use EXTERNAL TABLE and LOCATION together, Hive creates table and initially no data will present (assuming your data location is different from the Hive 'LOCATION').
When you use 'LOAD DATA INPATH' command, the data get MOVED (instead of copy) from data location to location that you specified while creating Hive table.
If location is not given when you create Hive table, it uses internal Hive warehouse location and data will get moved from your source data location to internal Hive data warehouse location (i.e. /user/hive/warehouse/).
An alternative to 'LOAD DATA' is available in which the data will not be moved from your existing source location to hive data warehouse location.
You can use ALTER TABLE command with 'LOCATION' option. Here is below required command
ALTER TABLE table_name ADD PARTITION (date_col='2017-02-07') LOCATION 'hdfs/path/to/location/'
The only condition here is, the location should be a directory instead of file.
Hope this will solve the problem.

No rows selected when trying to load csv file in hdfs to a hive table

I have a csv file called test.csv in hdfs. The file was placed there through filezilla. I am able to view the path as well as the contents of the file when I log in to Edge node through putty using the same account credentials that I used to place the file into hdfs. I then connect to Hive and try to create an external table specifying the location of my csv file in hdfs using the statement below:
CREATE EXTERNAL TABLE(col1 string, col2 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC LOCATION '/file path'
when I execute this command it is creating an external table on hive but the table that is being created is empty with only the columns showing up which i have already mentioned in the create statement. My question is, am I specifying the correct path in the location parameter in the create statement above? I tried using the path which I see on filezilla when I placed my csv file into hdfs which is in the format home/servername/username/directory/subdirectory/file
but this returns an error saying the user whose username is specified in the path above does not have ALL privileges on the file path.
NOTE: I checked the permissions on the file and the directory in which it resides and the user has all permissions(read,write and execute).
I then tried changing the path into the format user/username/directory/subdirectory/file and when I did this I was able to create the external table however the table is empty and does not load all the data in the csv file on which it was created.
I also tried the alternative method of creating an internal table as below and then using the LOAD DATA INPATH command. But this also failed as I am getting an error saying that "there are no files existing at the specified path".
CREATE TABLE foobar(key string, stats map<string, bigint>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':' ;
LOAD DATA INPATH '/tmp/foobar.csv' INTO TABLE foobar;
First thing you can't load csv file directly into Hive table which is specified with orc file format while creating. Orc is a compression technique to store data in optimised way. So you can load your data into orc format table by following below steps.
You should create a temp table as text file format.
Load data into it by using the command.
hive> load data in path.....
or else u can use location parameter while creating the table itself.
Now create a hive table as your required file format (RC, ORC, parquet, etc).
-Now load data into it by using following command.
hive> insert overwrite into table foobar as select * from temptbl;
You will get table in orc file format.
In second issue is if you Load data into the table by using LOAD DATA command, the data which is in your file will become empty and new dir will be created in default location (/user/hive/warehouse/) with the table name and data will moved into that file. So check in that location you will see the data.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

Creating hive table using configuration file

I know basic concepts of HIVE. My query is creating the hive table using the external configuration/schema file.
I know the basic query to create the hive table where we pass the column header and datatype in the create table statement . That is nothing but we hard code it.
But I wanted to create the hive table where it takes the column header and datatype from the external configuration file. Can it be done in Hive? It’s fine even we are supposed to write the unix shell script to achieve it but I’m not sure about it.
Below is the format of my configuration file :
Config.txt
id,Integer(2),NOT NULL
name,String(20)
state,String(5),NOT NULL
phone_no,Integer(4)
gender,Char(1)
As of now i have created one .hql file where i have written the hive create table statement script and calling the .hql file in the bash script file.
Below are the .hql file and .sh file:
hiveQ.hql:
create table goodrecs(
id int,
name string,
state string,
phone_no int,
gender string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/hduser/Dataparse/goodrec' INTO TABLE goodrecs;
testscript.sh :
#!/bin/bash
hive -f hiveQ.hql
In hiveQ.hql i wanted column headers and datatype should come from the config.txt file.
How this can be done ?
Thanks in advance
It is very convenient to change config.txt to a standard hql file,use a map which turns types in config.txt to hive column type such as integer to int,char to string.

Sequence File with Block Compression

I need to enable Sequence File with Block Compression data. Below is the table which will be stored as SequenceFile.
create table lip_data_quality
( buyer_id bigint,
total_chkout bigint,
total_errpds bigint
)
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as sequencefile
location '/apps/hdmi-technology/b_apdpds/lip-data-quality'
;
And in the above table, I am getting data in Compressed Form like this by enabling these commands-
set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
So my question is that's all I need to enable BLOCK Compression with Sequence File? Or is there anything else I need to do? I was following this article Hadoop
Any suggestion will be appreciated.
Update:-
I am loading the data in the above table like this by putting everything in a .hql file and running that hql file from the shell command prompt. And changing the partition date everytime while running the below hql file.
set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
insert overwrite table lip_data_quality partition (dt='20120712')
SELECT query here which will give the output for the above table.
That should be fine then. You can also verify it by looking at the files on HDFS. There should be a directory in HDFS named /user/hive/warehouse/lip_data_quality/dt=20120712 after your load. If you run
hadoop fs -cat
on one of the files in that folder you should be able to see the header of the file which will give you basic info on the file.
Set the below properties before submitting job.
setProperty(job, "mapred.output.compress", "true");
setProperty(job,"mapred.output.compression.type", "BLOCK");
setProperty(job,"mapred.output.compression.codec","org.apache.hadoop.io.compress.DefaultCodec");
Using DefaultCodec, one can use org.apache.hadoop.io.compress.LzoCodec;

Resources