Unable to read date value from pig to hive - hadoop

I have my data processed using pig and stored in an hdfs location(/tmp/output). This data now has to be read into a hive table which point to the same location(/tmp/ouput). But when I try to get the date value from the hive table I shows NULL.
Below are the commands I used:
STORE DATA into '/tmp/output' USING PigStorage('\u0001');
When I fire the below query :
hive -e "select load_date from 'STUDENT'"
It gives me NULL
2015-10-06T10:09:00.000-04:00 is the time format I see in /tmp/output.
Seems like hive is unable to read this format(timestamp in hive).
How can I convert this format into hive readable one.
Any help will be greatly appreciated!

We can use hcatstorer to store the pig output into hive table.but Bydefault hcatstorer consider datatype of input as a string.so at the end ,date column stored in hive table won't have date datatype.it will be string.

Related

How to insert the output of a pig script into hive external tables using a dynamically generated partition value?

I have written a pig script that would generate tuples of a hive table. I am trying to dump the results to a specific partition in HDFS where hive stores the table date. As of now the partition value I am using is a timestamp string value that is generated inside pigscript. I have to use this timestamp string value to store my pig script results but i am have no idea how to do that. Any help would be greatly appreciated.
If I understand it right you read some data from a partition of a HIVE table and want to store into another HIVE table partitions, right?
A HIVI partition (form HDFS perspective) is just a subfolder which name is constructed like this: fieldname_the_partitioning_is_based_on=value
For example you have a date partition it looks like this: hdfs_to_your_hive_table/date=20160607/
So all you need is to specify this output location in the store statement
STORE mydata INTO '$HIVE_DB.$TABLE' USING org.apache.hive.hcatalog.pig.HCatStorer('date=$today');

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

storing pig output into Hive table in a single instance

I would like to insert the pig output into Hive tables(tables in Hive is already created with the exact schema).Just need to insert the output values into table. I dont want to the usual method, wherein I first store into a file, then read that file from Hive and then insert into tables. I need to reduce that extra hop which is done.
Is it possible. If so please tell me how this can be done ?
Thanks
Ok. Create a external hive table with a schema layout somewhere in HDFS directory. Lets say
create external table emp_records(id int,
name String,
city String)
row formatted delimited
fields terminated by '|'
location '/user/cloudera/outputfiles/usecase1';
Just create a table like above and no need to load any file into that directory.
Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script use as below
A = LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');
Ensure that destination location and delimiter and schema layout of final FOREACH statement in you Pigscript matches with Hive DDL schema.
There are two approaches explained below with 'Employee' table example to store pig output into hive table. (Prerequisite is that hive table should be already created)
A = LOAD 'EMPLOYEE.txt' USING PigStorage(',') AS(EMP_NUM:int,EMP_NAME:chararray,EMP_PHONE:int);
Approach 1: Using Hcatalog
// dump pig result to Hive using Hcatalog
store A into 'Empdb.employee' using org.apache.hive.hcatalog.pig.HCatStorer();
(or)
Approach 2: Using HDFS physical location
// dump pig result to external hive warehouse location
STORE A INTO 'hdfs://<<nmhost>>:<<port>>/user/hive/warehouse/Empdb/employee/' USING PigStorage(',')
;
you can store it using Hcatalog
STORE D INTO 'tablename' USING org.apache.hive.hcatalog.pig.HCatStorer();
see below link
https://acadgild.com/blog/loading-and-storing-hive-data-into-pig
The best way is to use HCatalog and write the data in hive table.
STORE final_data INTO 'Hive_table_name' using org.apache.hive.hcatalog.pig.HCatStorer();
But before storing the data, make sure the columns in the 'final_data' dataset is perfectly matched and mapped with the schema of the table.
And run your pig script like this :
pig script.pig -useHCatalog

Hive date format not supporting in impala

Hive date format not supporting in impala.
I created partition on date column in hive table but when i can access the same table from hive_metadata in impala its showing
CAUSED BY: TableLoadingException: Failed to load metadata for table
'employee_part' because of unsupported partition-column type 'DATE' in
partition column 'hiredate'.
Please let me know which date format does hive and impala commonly support.
I used date format in hive as yyyy-mm-dd
Impala doesnt support the hive date format.
You have to use a timestamp (which means that you will always carry time but it will be 00:00:00.0000). Then depending on the tool you use after, you have to make a convertion again unfortunately.

HDFS String data to timestamp in hive table

Hi I have a data in HDFS as a string '2015-03-26T00:00:00+00:00' ..if i want to load this data into Hive table (column as timestamp).i am not able to load and i am getting null values.
if i specify column as string i am getting the data into hive table
but if i specify column as timestamp i am not able to load the data and i am getting all NULL values in that column.
Eg: HDFS - '2015-03-26T00:00:00+00:00'
hive table- create table t1(my_date string)
i can get output as - '2015-03-26T00:00:00+00:00'
if i specify create table t1(my_date as timestamp)--i can see all null values
Can any one help me on this
Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.
Go through below link:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps
You have to use a staging table. In the staging table load as String and in the final table use UDF as below to convert the string value to Timestamp
from_unixtime(unix_timestamp(column_name, 'dd-MM-yyyy HH:mm'))

Resources