Hive load CSV with commas in quoted fields - hadoop

I am trying to load a CSV file into a Hive table like so:
CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;
The csv is delimited by an comma (,) and looks like this:
1, "some text, with comma in it", 123, "more text"
This will return corrupt data since there is a ',' in the first string.
Is there a way to set an text delimiter or make Hive ignore the ',' in strings?
I can't change the delimiter of the csv since it gets pulled from an external source.

If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';
Will accept this line as 4 fields
1,some text\, with comma in it,123,more text

The problem is that Hive doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.

As of Hive 0.14, the CSV SerDe is a standard part of the Hive install
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
(See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde)

keep the delimiter in single quotes it will work.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
This will work

Add a backward slash in FIELDS TERMINATED BY '\;'
For Example:
CREATE TABLE demo_table_1_csv
COMMENT 'my_csv_table 1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'your_hdfs_path'
AS
select a.tran_uuid,a.cust_id,a.risk_flag,a.lookback_start_date,a.lookback_end_date,b.scn_name,b.alerted_risk_category,
CASE WHEN (b.activity_id is not null ) THEN 1 ELSE 0 END as Alert_Flag
FROM scn1_rcc1_agg as a LEFT OUTER JOIN scenario_activity_alert as b ON a.tran_uuid = b.activity_id;
I have tested it, and it worked.

ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE Serde worked for me. My delimiter was '|' and one of the columns is enclosed in double quotes.
Query:
CREATE EXTERNAL TABLE EMAIL(MESSAGE_ID STRING, TEXT STRING, TO_ADDRS STRING, FROM_ADDRS STRING, SUBJECT STRING, DATE STRING)
ROW FORMAT SERDE 'ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE'
WITH SERDEPROPERTIES (
"SEPARATORCHAR" = "|",
"QUOTECHAR" = "\"",
"ESCAPECHAR" = "\""
)
STORED AS TEXTFILE location '/user/abc/csv_folder';

Related

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.
You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;
Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

Hive external table on data containing newline

I have a few txt files on which I want to create an external table.
Unfortunately, the content of the files also contains the string "\n" from time to time. It seems that Hive interprets this as a newline, even though it's not a newline in the original file and is just part of the text.
Can I catch this problem in Hive without having to alter the original txt files?
You can put any other delimiter at end of each line(other than \n and your field separator).And than can register that delimiter in table properties.
Eg: Let's say I have record like this
1,2,3,aniit\n,4\n
In this record aniit\n is a string and \n is string.So hive makes it two record.To avoid this ,you can add any other delimiter at end.Like
1,2,3,aniit\n,4\n||
Here '||' is Line delimiter and my create table will look like :
create external table if not exists table1
(
col1 int,
col2 int,
col3 int,
col4 string,
col5 string
)
row format delimited fields terminated by ','
lines terminated by '||'
stored as textfile
location '/tmp/table1';

How to load CSV data with enclosed by double quotes and separated by tab into HIVE table?

I am trying to load data from a csv file in which the values are enclosed by double quotes '"' and tab separated '\t' .
But when I try to load that into hive its not throwing any error and data is loaded without any error but I think all the data is getting loaded into a single column and most of the values it showing as NULL.
below is my create table statement.
CREATE TABLE example
(
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '"'
STORED AS TEXTFILE;
Input file sample;-
"Organization" "Order" "Created on" "issue_date" "qty"
"GB" "111223" "2015/02/06 00:00:00" "2015/05/15 00:00:00" "5"
"UK" "1110" "2015/05/06 00:00:00" "2015/06/1 00:00:00" "51"
and Load statement to push data into hive table.
LOAD DATA INPATH '/user/example.csv' OVERWRITE INTO TABLE example
What could be the issue and how can I ignore header of the file.
and if I remove ESCAPED BY '"' from create statement its loading in respective columns but all the values are enclosed by double quotes.
How can I remove double quotes from values and ignore header of the file?
You can now use OpenCSVSerde which allows you to define the separator character and easily escape surrounding double-quotes :
CREATE EXTERNAL TABLE example (
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "\""
)
LOCATION '/your/folder/location/';
You don't want to use escaped by, that's for escape characters, not quote characters. I don't think that Hive actually has support for quote characters. You might want to take a look at this csv serde which accepts a quotechar property.
Also if you have HUE, you can use the metastore manager webapp to load the CSV in, this will deal with the header row, column datatypes and so on.
Use CSV Serde to create the table. I've created a table in hive as follows, and it works like charm.
CREATE EXTERNAL TABLE IF NOT EXISTS myTable (
id STRING,
url STRING,
name STRING
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ("separatorChar" = "\t")
LOCATION '<folder location>';
"Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex."
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
source = Ben Doerr
How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive
You can use a CSV serde " csv-serde-1.1.2.jar " to load the file without double quotes.
download link:
http://ogrodnek.github.io/csv-serde/
and the create table statement as
CREATE TABLE <table_name> (col_name_1 type1, col_name_2 type2, ...) row format serde 'com.bizo.hive.serde.csv.CSVSerde';
you can remove the header with the following property in the create table stmt
tblproperties ("skip.header.line.count"="1");

Hive: How to delimit rows using a string literal

Need help here.
This is related to hive.
i have a text file with a single long line, for e.g:
JASON 29\SASHA 24\CHRISTINE 15\ROBERT 20\
Now i need to create a table in hive, whose rows are delimited using "\" (backslash), like if i insert the data from the above mentioned line "JASON 29\SASHA 24...." i would want 4 rows to be inseted in my table.
in other words, i want my custom char to be row delimiters, and not the default "\n".
i wrote the DDL:
CREATE TABLE newline_tab
(
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\\'
STORED AS TEXTFILE;
but i am unable to create the table, and im getting following error:
FAILED: SemanticException 9:20 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''\''
any help would be appreciated :)
CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

why don't Hive have FIELDS ENCLOSED BY like in MySQL?

here is my case :
input lines:
"vijay" <\t> "a-b-c","a-c-d","a-d-c"
"kumar" <\t> "a-b-c","b-c-d""
i created table like this :
hive >create table user_infos(name string, path ARRAY<String> --i need array only)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE ;
output received :
hive > select * from user_infos ;
"vijay" ["**\"a-b-c\"**","**\"a-c-d\"**","**\"a-d-c\"**"]
"kumar" ["**\"a-b-c\"**","**\"b-c-d\"**"]
problem here is : i don't want double quotes i.e., \"
Required output :
vijay ["a-b-c","a-c-d","a-d-c"]
kumar ["a-b-c","b-c-d"]
Is there any why to achieve this not using custom Serde. Any thing like ENCLOSED BY like in mysql?
I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.
So I have searched with links and I have found perfect solution for this.
#ramisetty.vijay: Yes, We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar
then follow below steps using hive prompt :
add jar path/to/csv-serde.jar;
create table employee1(id string, name string, addr string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\;",
"quoteChar" = "\"")
stored as textfile
;
and then load data from your given path using below query:
load data local inpath 'path/xyz.csv' into table employee1;
and then run :
select * from employee1;
Thanks.

Resources