Empty String is not treated as null in Hive - hadoop

My understanding of the following statement is that if blank or empty string is inserted into hive column, it will be treated as null.
TBLPROPERTIES('serialization.null.format'=''
To test the functionality i have created a table and insertted '' to the filed 3. When i query for nulls on the field3, there are no rows with that criteria.
Is my understanding of making blank string to null correct??
CREATE TABLE CDR
(
field1 string,
field2 string,
field3 string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
**TBLPROPERTIES('serialization.null.format'='');**
insert overwrite table emmtest.cdr select **field1,field2,''** from emmtest.cdr_non_orc;
select * from emmtest.cdr where **field3 is null;**
The last statement has not returned any rows. But i am expecting all rows to be returned since there is blank string in field3.

TBLPROPERTIES('serialization.null.format'='') means the following:
An empty field in the data files will be treated as NULL when you query the table
When inserting rows to the table, NULL values will be written to the data files as empty fields
You are doing something else -
You are inserting an empty string to a table from a query.
It is treated "as is" - an empty string.
Demo
bash
hdfs dfs -mkdir /user/hive/warehouse/mytable
echo Hello,,World | hdfs dfs -put - /user/hive/warehouse/mytable/data.txt
hive
create table mytable (s1 string,s2 string,s3 string)
row format delimited
fields terminated by ','
;
hive> select * from mytable;
OK
s1 s2 s3
Hello World
hive> alter table mytable set tblproperties ('serialization.null.format'='');
OK
hive> select * from mytable;
OK
s1 s2 s3
Hello NULL World

You can use the following in your Hive Query properties:
NULL DEFINED AS ''
or any character inside the quotes.

Related

Insert overwrite directory using Presto like Hive

In Hive, the statement below will output foo^Bbar^Abaz
insert overwrite directory 's3://bucket-name/foobarbaz'
row format delimited
fields terminated by '\001'
select split('foo,bar', ','), 'baz';
In Presto, I ran this statement:
insert overwrite directory 's3://bucket-name/foobarbaz'
select split('foo,bar', ','), 'baz';
With this result: ["foo","bar"]^Abaz
What is the equivalent Presto clause for insert overwrite directory that works for arrays and structs?
It seems like Presto converted my array type into a json string, but I want this formatted to Hadoop spec with collection item and map key delimiter support.
Try to specify COLLECTION ITEMS TERMINATED BY in the create table DDL.
row_format DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] ...

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.
You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;
Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

Excluding the partition field from select queries in Hive

Suppose I have a table definition as follows in Hive(the actual table has around 65 columns):
CREATE EXTERNAL TABLE S.TEST (
COL1 STRING,
COL2 STRING
)
PARTITIONED BY (extract_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LOCATION 'xxx';
Once the table is created, when I run hive -e "describe s.test", I see extract_date as being one of the columns on the table. Doing a select * from s.test also returns extract_date column values. Is it possible to exclude this virtual(?) column when running select queries in Hive.
Change this property
set hive.support.quoted.identifiers=none;
and run the query as
SELECT `(extract_date)?+.+` FROM <table_name>;
I tested it working fine.

How to specify custom string for NULL values in Hive table stored as text?

When storing Hive table in text format, for example this table:
CREATE EXTERNAL TABLE clustered_item_info
(
country_id int,
item_id string,
productgroup int,
category string
)
PARTITIONED BY (cluster_id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '${hivevar:table_location}';
Fields with null values are represented as '\N' strings, also for numbers NaNs are represented as 'NaN' strings.
Does Hive provide a way to specify custom string to represent these special values?
I would like to use empty strings instead of '\N' and 0 instead of 'NaN' - I know this substitution can be done with streaming, but is there any way to it cleanly using Hive instead of writing extra code?
Other info:
I'm using Hive 0.8 if that matters...
Use this property while creating the table
CREATE TABLE IF NOT EXISTS abc
(
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("serialization.null.format"="")
oh, sorry. I read your question not clear
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE DIRECTORY 's3://bucket/result/'
SELECT NULL, COALESCE(NULL,"")
FROM data_table;

getting null values while loading the data from flat files into hive tables

I am getting the null values while loading the data from flat files into hive tables.
my tables structure is like this:
hive> create table test_hive (id int,value string);
and my flat file is like this:
input.txt
1 a
2 b
3 c
4 d
5 e
6 F
7 G
8 j
when I am running the below commands I am getting null values:
hive> LOAD DATA LOCAL INPATH '/home/hduser/input.txt' OVERWRITE INTO TABLE test_hive;
hive> select * from test_hive;
OK<br>
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
screen shot:
hive> create table test_hive (id int,value string);
OK
Time taken: 4.97 seconds
hive> show tables;
OK
test_hive
Time taken: 0.124 seconds
hive> LOAD DATA LOCAL INPATH '/home/hduser/input2.txt' OVERWRITE INTO TABLE test_hive;
Copying data from file:/home/hduser/input2.txt
Copying file: file:/home/hduser/input2.txt
Loading data to table default.test_hive
Deleted hdfs://hydhtc227141d:54310/app/hive/warehouse/test_hive
OK
Time taken: 0.572 seconds
hive> select * from test_hive;
OK
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
Time taken: 0.182 seconds
The default field terminator in Hive is ^A. You need to explicitly mention in your create table statement that you are using a different field separator.
Similar to what Lorand Bending pointed in the comment, use:
CREATE TABLE test_hive(id INT, value STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
You don't need to specify a location since you are creating a managed table (and not an external table).
Problem you are facing is because in your data the fields are separated by ' ' and while creating table you did not mention the field delimiter. So if you don't mention the field delimiter while creating hive table, by default hive considers ^A as delimiter.
So to resolve your problem, you can recreate the table mentioning the below syntax and it would work.
CREATE TABLE test_hive(id INT, value STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
The solution is quite simple. The Table wan't created in the right way.
Simple solution for your problem or any further problems is knowing how to load the data.
CREATE TABLE [IF NOT EXIST] mytableName(id int,value string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t'
STORED AS TEXTFILE ;
Now lemme explain the code :
First Line
Creating your table. The [IF NOT EXIST] is optional that tells if the table exist don't overwrite it. Its more of safety measure.
Second line
Specifies a delimiter at the table level for structured fields.
Third Item
You can include any single character, but the default is '\001'.
'/t' is for a tab space : in your case
'|' is for data which are beside each other and separated by |
' ' for one char space. And so on...
Forth Line :
Specifies the type of file in which data is to be stored. The file can be a TEXTFILE, SEQUENCEFILE, RCFILE, or BINARY SEQUENCEFILE. Or, how the data is stored can be specified as Java input and output classes.
when loading Locally :
LOCD DATA LOCAL INPATH '/your/data/path.csv' [OVERWRITE] INTO TABLE myTableName;
Always try checking your data by a simple select* statement.
Hope it helps.
Hive’s default record and field delimiters list:
\n
^A
^B
^C
press ^V^A could insert a ^A in Vim.
The elements are separated by space or tab? Let it's tab follow these steps. If separated space use ' ' instead of '\t' Ok.
hive> CREATE TABLE test_hive(id INT, value STRING) row format
delimited fields terminated by '\t' line formated by '\n' stored as filename;
Than you have to enter
hive> LOAD DATA LOCAL INPATH '/home/hduser/input.txt' OVERWRITE INTO TABLE test_hive;
hive> select * from test_hive;
Now you will get exact your expected output "filename".
please check the dataset date column it should follow the date format yyyy-mm-dd
If the string is in the form 'yyyy-mm-dd', then a date value corresponding to that year/month/day is returned. If the string value does not match this formate, then NULL is returned.
Hive Official documentation

Resources