How to make hive table match data using column names and not using ordinal positions - hadoop

If I have a csv like -
colName1,colName2
col1Value,col2Value
and a hive ddl like -
CREATE EXTERNAL TABLE tableName (
col2 STRING,
col1 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hdfs://location/to/testcsv/directory'
tblproperties ("skip.header.line.count"="1");
//select col2 from tableName; gives col1Value
This is obviously because in case of text files hive matches column to data field by ordinal position matching. If the underlying file is parquet then the match is done using column names.
I was wondering is there is a hive SerDe someone has written or maybe some SerDe property I am missing that tells hive to map data field names with hive table column names, such that in above example it would return "col2Value" when col2 is queried, even though ordinal position of col2 in hive table and data file does not match.
Thanks in advance!

Related

HIVE - create external tables where string itself contains commas

I am new to Hive and am creating external tables on csv file. One of the issues I am coming across are values that contain multiple commas within string itself. For example, the csv file contains the following:
CSV File
When I create an external table in Hive, because there are columns within the "name" column, it shifts the first name to the right adding another column. This throws all of the data off when you view the table in Hive.
External Table result in Hive
Is there anything I can add to my script to keep the commas but also keep first and last name in the same column when the external table is created? Thank you all in advance - I am very new to Hive.
CREATE EXTERNAL TABLE database.table name (
ID INT,
Name String,
City String,
State String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/xyz/xyz/database/directory/'
TBLPROPERTIES ("skip.header.line.count"="1");
Check this solution - you need to add this line : ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://community.cloudera.com/t5/Support-Questions/comma-in-between-data-of-csv-mapped-to-external-table-in/td-p/220193
Complete DDL example:
create table hcc(field1 string,
field2 string,
field3 string,
field4 string,
field5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"");

How to Copy TEXT format partitioned table to ORC format Table in Hive

i have a Text Format hive table, like:
CREATE EXTERNAL TABLE op_log (
time string, debug string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
now i create a orc format table with same fields, like
CREATE TABLE op_log_orc (
time string, debug string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");
when i copy from op_log to op_log_orc, i have get this errors:
hive> insert into op_log_orc PARTITION(dt='2016-08-09') select * from op_log where dt='2016-08-09';
FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different ''2016-08-09'': Table insclause-0 has 62 columns, but query has 63 columns.
hive>
The partition key (dt) in the source table is returned in the result set as though it were a regular field, so you have the extra column. Exclude the dt field from the field list (instead of *) if you're going to specify its value in the partition key. Alternatively, just specify dt as the name of the partition, without providing a value. See CTAS (create table as select...) in the example here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS)

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.
You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;
Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

How to store date value in hive timestamp?

I am trying to store the date and timestamp values in timestamp column using hive. The source file contain the values of date or sometimes timestamps.
Is there a way to read both date and timestamp by using the timestamp data type in hive.
Input:
2015-01-01
2015-10-10 12:00:00.232
2016-02-01
Output which I am getting:
null
2015-10-10 12:00:00.232
null
Is it possible to read both values by using timestamp data type.
DDL:
create external table mytime(id string ,t timestamp) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://xxx/data/dev/ind/'
I was able think of a workaround. tried this with a small set of data:
Load the data with inconsistent date data into a hive table say table1 by making the column as string datatype .
Now create another table table2 with the datatype as timestamp for the required column and load the data from table1 to table2 using the transformation INSERT OVERWRITE TABLE table2 select id,if(length(tsstr) > 10, tsstr, concat(tsstr,' 00:00:00')) from table1;
This should load the data in required format.
Code as below:
`
create table table1
(
id int,
tsstr string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/user/cloudera/hive/table1.tb';
Data:
1,2015-04-15 00:00:00
2,2015-04-16 00:00:00
3,2015-04-17
LOAD DATA LOCAL INPATH '/home/cloudera/data/tsstr' INTO TABLE table1;
create table table2
(
id int,
mytimestamp timestamp
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/user/cloudera/hive/table2.tb';
INSERT INTO TABLE table2 select id,if(length(tsstr) > 10, tsstr, concat(tsstr,' 00:00:00')) from table1;
Result shows up as expected:
Hive is similar to any other database in terms of datatype mapping and hence requires a uniform values for a specific column to be stored under a conformed datatype. The data in your file for second column has non-uniform data i.e, some are in date format while others in timestamp format.
In order to not to lose the date, as suggested by #Kishore , make sure you have a uniform datatype in the file and get the file with timestamp values as 2016-01-01 00:00:000 where there are only dates.

How do I Insert data from text table (using MultiDelimitSerDe) to Avro Table?

I noticed that I can use an insert into statement from text table to avro table when not using the MultiDelimitSerDe. It also works with ROW FORMAT DELIMITED FIELDS TERMINATED BY "," i.e. a single character.
I create 2 tables - 1 text table and 1 avro table:
CREATE TABLE example1 ( example STRING, example2 STRING, example3
STRING ) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES ("field.delim"="**") STORED AS TEXTFILE ;
CREATE TABLE example2 ( example STRING, example2 STRING, example3
STRING ) STORED AS AVRO;
I then load data into example1 table (file delimited by "**")i.e.
LOAD DATA INPATH 'HDFS-path' INTO TABLE example1;
example1 now has data inside it. I want to insert the data from example1 to example2.
INSERT INTO TABLE example2 SELECT * from example1;
This however, gives a "return code 2" error. I have no idea why I am unable to insert the data using the MultiDelimitSerDe but I am able to do this with "ROW FORMAT DELIMITED FIELDS TERMINATED BY". But, I need to use a multi-delimiter.
Could anyone help me please?
Have you added the required JAR file?
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' - Make sure you have the required JAR file for this (hive_contrib.jar).

Resources