How to specify custom string for NULL values in Hive table stored as text? - hadoop

When storing Hive table in text format, for example this table:
CREATE EXTERNAL TABLE clustered_item_info
(
country_id int,
item_id string,
productgroup int,
category string
)
PARTITIONED BY (cluster_id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '${hivevar:table_location}';
Fields with null values are represented as '\N' strings, also for numbers NaNs are represented as 'NaN' strings.
Does Hive provide a way to specify custom string to represent these special values?
I would like to use empty strings instead of '\N' and 0 instead of 'NaN' - I know this substitution can be done with streaming, but is there any way to it cleanly using Hive instead of writing extra code?
Other info:
I'm using Hive 0.8 if that matters...

Use this property while creating the table
CREATE TABLE IF NOT EXISTS abc
(
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("serialization.null.format"="")

oh, sorry. I read your question not clear
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE DIRECTORY 's3://bucket/result/'
SELECT NULL, COALESCE(NULL,"")
FROM data_table;

Related

HIVE - create external tables where string itself contains commas

I am new to Hive and am creating external tables on csv file. One of the issues I am coming across are values that contain multiple commas within string itself. For example, the csv file contains the following:
CSV File
When I create an external table in Hive, because there are columns within the "name" column, it shifts the first name to the right adding another column. This throws all of the data off when you view the table in Hive.
External Table result in Hive
Is there anything I can add to my script to keep the commas but also keep first and last name in the same column when the external table is created? Thank you all in advance - I am very new to Hive.
CREATE EXTERNAL TABLE database.table name (
ID INT,
Name String,
City String,
State String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/xyz/xyz/database/directory/'
TBLPROPERTIES ("skip.header.line.count"="1");
Check this solution - you need to add this line : ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://community.cloudera.com/t5/Support-Questions/comma-in-between-data-of-csv-mapped-to-external-table-in/td-p/220193
Complete DDL example:
create table hcc(field1 string,
field2 string,
field3 string,
field4 string,
field5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"");

How to make hive table match data using column names and not using ordinal positions

If I have a csv like -
colName1,colName2
col1Value,col2Value
and a hive ddl like -
CREATE EXTERNAL TABLE tableName (
col2 STRING,
col1 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hdfs://location/to/testcsv/directory'
tblproperties ("skip.header.line.count"="1");
//select col2 from tableName; gives col1Value
This is obviously because in case of text files hive matches column to data field by ordinal position matching. If the underlying file is parquet then the match is done using column names.
I was wondering is there is a hive SerDe someone has written or maybe some SerDe property I am missing that tells hive to map data field names with hive table column names, such that in above example it would return "col2Value" when col2 is queried, even though ordinal position of col2 in hive table and data file does not match.
Thanks in advance!

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.
You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;
Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

How to load CSV data with enclosed by double quotes and separated by tab into HIVE table?

I am trying to load data from a csv file in which the values are enclosed by double quotes '"' and tab separated '\t' .
But when I try to load that into hive its not throwing any error and data is loaded without any error but I think all the data is getting loaded into a single column and most of the values it showing as NULL.
below is my create table statement.
CREATE TABLE example
(
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '"'
STORED AS TEXTFILE;
Input file sample;-
"Organization" "Order" "Created on" "issue_date" "qty"
"GB" "111223" "2015/02/06 00:00:00" "2015/05/15 00:00:00" "5"
"UK" "1110" "2015/05/06 00:00:00" "2015/06/1 00:00:00" "51"
and Load statement to push data into hive table.
LOAD DATA INPATH '/user/example.csv' OVERWRITE INTO TABLE example
What could be the issue and how can I ignore header of the file.
and if I remove ESCAPED BY '"' from create statement its loading in respective columns but all the values are enclosed by double quotes.
How can I remove double quotes from values and ignore header of the file?
You can now use OpenCSVSerde which allows you to define the separator character and easily escape surrounding double-quotes :
CREATE EXTERNAL TABLE example (
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "\""
)
LOCATION '/your/folder/location/';
You don't want to use escaped by, that's for escape characters, not quote characters. I don't think that Hive actually has support for quote characters. You might want to take a look at this csv serde which accepts a quotechar property.
Also if you have HUE, you can use the metastore manager webapp to load the CSV in, this will deal with the header row, column datatypes and so on.
Use CSV Serde to create the table. I've created a table in hive as follows, and it works like charm.
CREATE EXTERNAL TABLE IF NOT EXISTS myTable (
id STRING,
url STRING,
name STRING
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ("separatorChar" = "\t")
LOCATION '<folder location>';
"Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex."
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
source = Ben Doerr
How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive
You can use a CSV serde " csv-serde-1.1.2.jar " to load the file without double quotes.
download link:
http://ogrodnek.github.io/csv-serde/
and the create table statement as
CREATE TABLE <table_name> (col_name_1 type1, col_name_2 type2, ...) row format serde 'com.bizo.hive.serde.csv.CSVSerde';
you can remove the header with the following property in the create table stmt
tblproperties ("skip.header.line.count"="1");

Collecting data from file using Hive

I Have my data in this format.
"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";
the fields are enclosed in "" and are delimited by ;
Also the book name may contain ';' in between.
Can you tell me how to load this data from file to hive table
the below query which i am using now obviously not working ;
create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
if possible i want the userid and year fields to be stored as Int.
Please help
Thanks,
Harish
The thing you are missing is RegexSerDe. It's very helpful in inserting only a part of text from the input. Your DDL goes like :
create table books ( isbn string, title string, year string, publisher string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(?:\")(\\d*)(?:\"\;\")([^\"]*)(?:\"\;\")(\\d*)(?:\"\;\")([^\"]*)\"(?:\;)" ,
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS TEXTFILE;
The regex may look complex at the first sight due the escaping and non-capturing groups. Actually it contains 2 groups (\d*) & ([^"]*) placed alternately two times. The non-capturing groups ((?:) just helps to remove the unnecessary context. The group ([^"]*) also take care of ';' inside bookName field.
But nothing comes without a cost. Despite all of its features, RegexSerDe supports only string fields. All you can do is to call the default hive UDF cast to do the transformation when selecting the data from the table. eg(actual syntax may vary a bit) :
SELECT cast( year as int ) from books;
Hope this helps.

Resources