loading data from hdfs into hive results null output in the table - hadoop

I am trying to load the data from HDFS into hive data warehouse using hive serialization and deserialization query but while retrieving from the table results null output.
Can any one please help me out?
hive>create table stations(usaf string, wban string, name string)
>row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>with SERDEPROPERTIES(
>"input.regex" ="(\\d{6}) (\\d{5}) (.{29}) .*"
>);
hive> load data inpath '/user/cloudera/input-new/ncdc/metadata/stations-fixed-width.txt'
>into table stations;
While retrieving from table
hive>select * from stations limit 4;
Results:
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
Sample data look like this:
010014 99999 SOERSTOKKEN NO NO ENSO +59783 +005350 +00500

checked ur regex - it's correct only.
Just
Add output.format.string in SERDEPROPERTIES as follows :
create table stations(usaf string, wban string, name string)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with SERDEPROPERTIES(
"input.regex" ="(\\d{6}) (\\d{5}) (.{29}) .*",
"output.format.string" = "%1$s %2$s %3$s"
)
;
Plz check the execution trace image

Related

How to get default values of table properties in Hive?

I created an internal table using HiveQL:
CREATE TABLE city (
id INT,
city VARCHAR(15)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS TEXTFILE;
Inserted one record
INSERT INTO city SELECT 1, null;
I want to know which default values are used. But Hive returns 'Table default.city does not have property'
SHOW TBLPROPERTIES city('serialization.format');
SHOW TBLPROPERTIES city('serialization.null.format');
SHOW TBLPROPERTIES city('serialization.encoding');
SHOW TBLPROPERTIES city('serialization.escape.crlf');
I also don't see them using the describe command:
DESCRIBE FORMATTED city;
I found out which values are used analyzing files on HDFS but I want to know if there is any easy way to get default values using HiveQL.

Hive Error: ORC does not support type conversion from DATE to TIMESTAMP

I have a source table in Hive with DDL as:
CREATE EXTERNAL TABLE JRNL.SOURCE_TAB(
ticket_id varchar(11),
ttr_start timestamp,
ttr_stop timestamp
)
PARTITIONED BY (
exp_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://my.cluster.net:8020//db/data/SOURCE_TAB'
TBLPROPERTIES (
'last_modified_by'='edpintdatd',
'last_modified_time'='1466093031',
'serialization.null.format'='',
'transient_lastDdlTime'='1466093031')
When I am querying the table:
hive> select exp_dt from JRNL.SOURCE_TAB limit 3;
It is giving me an Exception:
Failed with exception java.io.IOException:java.io.IOException: ORC does not support type conversion from DATE to TIMESTAMP
Even when I tried to create a replica table like the above source, using:
CREATE TABLE JRNL.SOURCE_TAB_BKP(
ticket_id varchar(11),
ttr_start timestamp,
ttr_stop timestamp
)
PARTITIONED BY (exp_dt string);
and then inserting data in this table using:
INSERT INTO TABLE JRNL.SOURCE_TAB_BKP PARTITION (exp_dt)
SELECT
ticket_id,
ttr_start,
ttr_stop,
exp_dt string
FROM JRNL.SOURCE_TAB;
it is still giving me the error ORC does not support type conversion from DATE to TIMESTAMP
I tried using
to_utc_timestamp(unix_timestamp(ttr_start),'UTC'),
to_utc_timestamp(unix_timestamp(ttr_stop),'UTC'),
but this isn't helping either.
I have already set the hive.exec.dynamic.partition.mode=nonstrict.
I even used CAST(.... as DATE), CAST(.... as TIMESTAMP). Didn't work either.

Inserting into Hive Table error

I am looking to encode columns of a table in hive.
I tried:
hive> create table encode_test(id int, name STRING, phone STRING, address STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES ('column.encode.columns'='phone,address', 'column.encode.classname'='org.apache.hadoop.hive.serde2.Base64WriteOnly') STORED AS TEXTFILE;
Say i have a CSV file, with following row
100,'navis','010-0000-0000','Seoul Seocho'
Now i tried to use.
LOAD DATA LOCAL INPATH
'/home/path/to/csv/test.csv'
INTO TABLE encode_test;
But when doing Select * from encode_test i am getting all columns NULL
Whereas the result should have been
100 navis MDEwLTAwMDAtMDAwMA== U2VvdWwsIFNlb2Nobw==
Also i want to give Fields TERMINATED BY ',' IN create table encode_test query.
but i am getting error: EOF error Near Fields
I also tried creating another table sample
create table sample(id int, name STRING, phone STRING, address STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
And then imported the csv file in the sample table. and it was successfully imported.
then i tried using.
insert into encode_test select * from sample;
But i am getting this new error
Permission denied: user=root, access=WRITE, inode="/user":h dfs:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.c heckFsPermission(DefaultAuthorizationProvider.java:279)
I'n new into hadoop
Please refer to this link from where i tried this problem
In Hive DDL, ROW FORMAT SERDE and FIELDS TERMINATED BY cannot co-exist together. Instead you can use, field.delim serde property.
create table encode_test(id int, name STRING, phone STRING, address STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'column.encode.columns'='phone,address',
'column.encode.classname'='org.apache.hadoop.hive.serde2.Base64WriteOnly')
STORED AS TEXTFILE;
And for the PermissionDenied exception, run the hive queries as either hdfs or hive user since root user does not have WRITE access to HDFS.

How to create external hive table with complex data types which points to hbase table?

I have a hbase table with Column families (Name, Contact) and columns, Name(String), Age(String), workStreet(String), workCity(String), workState(String).
I want to create an external hive table which points to this hbase table with following columns.
Name(String), Age(String), Address(Struct).
CREATE EXTERNAL TABLE hiveTable(id INT,name STRING, age STRING,
address STRUCT<Street:STRING,City:STRING,State:STRING>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" ="Name:name,Name:age,Contact:workStreet, Contact:workCity, Contact:workState")
TBLPROPERTIES("hbase.table.name" = "hbaseTable");
It ran into the following error,
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException:
MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: columns has 3 elements while hbase.columns.mapping
has 5 elements (counting the key if implicit))
I have tried using Map instead of Struct. Below is the query,
CREATE EXTERNAL TABLE hiveTable(id INT,name STRING,age STRING,
address MAP<String,String>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "Name:name,Name:,Contact:")
TBLPROPERTIES("hbase.table.name" = "hbaseTable");

Hive Text format with multi line column to ORC

When a hive table in text format with multiline column gets converted to ORC format, it fails to read the columns right.
Hive table with custom record delimiter
CREATE EXTERNAL TABLE IF NOT EXISTS MULTILINE_XML_TXT
(id INT, name STRING, xml STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/practice/xml/multiline/mysql/text/in/'
TBLPROPERTIES ('textinputformat.record.delimiter'='#');
The xml column in the above table has data in multiple lines.
When i query from this table, i see the data right.
Sample data (2 rows) in the above table
100 xyz <employees><employee><age>26</age>
</employee><employee><age>45</age>
</employee></employees>
200 abc <employees><employee><age>20</age>
</employee><employee>
<age>50</age></employee></employees>
I created another table with the ORC format and copied data from the text table to the ORC table, but the conversion is not correct.
CREATE TABLE IF NOT EXISTS MULTILINE_XML_ORC
(id INT, name STRING, xml STRING) STORED AS ORC;
INSERT OVERWRITE TABLE MULTILINE_XML_ORC
SELECT id, name, xml FROM MULTILINE_XML_TXT;
Executing the query select * from MULTILINE_XML_ORC gives the following result, which is incorrect.
100 xyz <employees><employee><age>26</age>
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL abc <employees><employee><age>20</age>
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
Any thoughts?

Resources